Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale.
The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors.
INTRODUCTION
a conventional CNN:
[image] → [crop/warp] → [conv] → [layers] → [fc layers] → [output]
spatial pyramid pooling network structure:
[image] → [conv layers] → [spatial pyramid pooling] → [fc layers] → [output]
the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.
The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fullyconnected layers (or other classifiers).As an extension of the Bag-of-Words (BoW) model
We note that SPP has several remarkable properties for deep CNNs:
- SPP is able to generate a fixedlength output regardless of the input size, while the sliding window pooling used in the previous deep networks cannot
- SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations
- SPP can pool features extracted at variable scales thanks to the flexibility of input scales.
SPP-net also allows us to feed images with varying sizes or scales during training.
DEEP NETWORKS WITH SPATIAL PYRAMID POOLING
Convolutional Layers and Feature Maps
These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows.
These outputs are known as feature maps they involve not only the strength of the responses, but also their spatial positions.
The Spatial Pyramid Pooling Layer
Spatial pyramid pooling improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size.
In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.
The coarsest pyramid level has a single bin that covers the entire image.The global pooling operation corresponds to the traditional Bag-of-Words method
Training the Network
Single-size training
we implement this pooling level as a sliding window pooling, where the window size w i n = ⌈ a / n ⌉ win = \lceil a/n \rceil win=⌈a/n⌉ and stride s t r = ⌊ a / n ⌋ str = \lfloor a/n \rfloor str=⌊a/n⌋ with ⌈ ⋅ ⌉ \lceil \cdot \rceil ⌈⋅⌉ and ⌊ ⋅ ⌋ \lfloor \cdot \rfloor ⌊⋅⌋ denoting ceiling and floor operations. With an l l l-level pyramid, we implement l l l such layers. The next fully-connected layer (fc 6 ) will concatenate the l l l outputs.
Multi-size training
we train each full epoch on one network, and then switch to the other one (keeping all weights) for the next full epoch.
The main purpose of our multi-size training is to simulate the varying input sizes while still leveraging the existing well-optimized fixed-size implementations.
Note that the above single/multi-size solutions are for training only. At the testing stage, it is straightforward to apply SPP-net on images of any sizes.
SPP-NET FOR IMAGE CLASSIFICATION
Experiments on ImageNet 2012 Classification
Baseline Network Architectures
- ZF-5
- Convnet*-5
- Overfeat-5/7
Multi-level Pooling Improves Accuracy
It is worth noticing that the gain of multi-level pooling is not simply due to more parameters; rather, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout
Multi-size Training Improves Accuracy
Full-image Representations Improve Accuracy
First, we empirically find that even for the combination of dozens of views, the additional two full-image views (with flipping) can still boost the accuracy by about 0.2%.
Second, the full-image view is methodologically consistent with the traditional methods , where the encoded SIFT vectors of the entire image are pooled together.
Third, in other applications such as image retrieval, an image representation, rather than a classification score, is required for similarity ranking. A full-image representation can be preferred.
Multi-view Testing on Feature Maps
Summary and Results for ILSVRC 2014
Experiments on VOC 2007 Classification
Experiments on Caltech101
SPP-NET FOR OBJECT DETECTION
We extract the feature maps from the entire image only once (possibly at multiple scales). Then we apply the spatial pyramid pooling on each candidate window of the feature maps to pool a fixed-length representation of this window. Because the time-consuming convolutions are only applied once, our method can run orders of magnitude faster.
Detection Algorithm
Detection Results
Complexity and Running Time
Model Combination for Detection
We further find that the complementarity is mainly because of the convolutional layers.
ILSVRC 2014 Detection
CONCLUSION
APPENDIX A
Mean Subtraction.
Implementation of Pooling Bins.
For a pyramid level with n × n n \times n n×n bins, the ( i , j ) (i, j) (i,j)-th bin is in the range of [ ⌊ i − 1 n w ⌋ , ⌈ i n w ⌉ ] × [ ⌊ j − 1 n h ⌋ , ⌈ j n h ⌉ ] [\lfloor \frac{i-1}{n} w \rfloor, \lceil \frac {i}{n} w \rceil] \times [\lfloor \frac{j-1}{n} h \rfloor, \lceil \frac{j}{n} h\rceil ] [⌊ni−1w⌋,⌈niw⌉]×[⌊nj−1h⌋,⌈njh⌉]
Mapping a Window to Feature Maps.