Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

最新推荐文章于 2019-07-18 22:07:46 发布

TifaBest

最新推荐文章于 2019-07-18 22:07:46 发布

阅读量567

点赞数

分类专栏：读后笔记文章标签：深度学习目标识别

本文链接：https://blog.csdn.net/Tifa_Best/article/details/88081469

版权

读后笔记专栏收录该内容

24 篇文章 0 订阅

订阅专栏

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale.

The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors.

INTRODUCTION

a conventional CNN:
[image] → [crop/warp] → [conv] → [layers] → [fc layers] → [output]

spatial pyramid pooling network structure:
[image] → [conv layers] → [spatial pyramid pooling] → [fc layers] → [output]

the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.

The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fullyconnected layers (or other classifiers).As an extension of the Bag-of-Words (BoW) model

We note that SPP has several remarkable properties for deep CNNs:

SPP is able to generate a fixedlength output regardless of the input size, while the sliding window pooling used in the previous deep networks cannot
SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations
SPP can pool features extracted at variable scales thanks to the flexibility of input scales.

SPP-net also allows us to feed images with varying sizes or scales during training.

DEEP NETWORKS WITH SPATIAL PYRAMID POOLING

Convolutional Layers and Feature Maps

These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows.

These outputs are known as feature maps they involve not only the strength of the responses, but also their spatial positions.

The Spatial Pyramid Pooling Layer

spp

Spatial pyramid pooling improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size.

In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.

The coarsest pyramid level has a single bin that covers the entire image.The global pooling operation corresponds to the traditional Bag-of-Words method

Training the Network

Single-size training

we implement this pooling level as a sliding window pooling, where the window size $\lceil a/n \rceil$ and stride $\lfloor a/n \rfloor$ with $\lceil \cdot \rceil$ and $\lfloor \cdot \rfloor$ denoting ceiling and floor operations. With an $l$ -level pyramid, we implement $l$ such layers. The next fully-connected layer (fc 6 ) will concatenate the $l$ outputs.

Multi-size training

we train each full epoch on one network, and then switch to the other one (keeping all weights) for the next full epoch.

The main purpose of our multi-size training is to simulate the varying input sizes while still leveraging the existing well-optimized fixed-size implementations.

Note that the above single/multi-size solutions are for training only. At the testing stage, it is straightforward to apply SPP-net on images of any sizes.

SPP-NET FOR IMAGE CLASSIFICATION

Experiments on ImageNet 2012 Classification

Baseline Network Architectures

ZF-5
Convnet*-5
Overfeat-5/7

Multi-level Pooling Improves Accuracy

It is worth noticing that the gain of multi-level pooling is not simply due to more parameters; rather, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout

Multi-size Training Improves Accuracy

Full-image Representations Improve Accuracy

First, we empirically find that even for the combination of dozens of views, the additional two full-image views (with flipping) can still boost the accuracy by about 0.2%.

Second, the full-image view is methodologically consistent with the traditional methods , where the encoded SIFT vectors of the entire image are pooled together.

Third, in other applications such as image retrieval, an image representation, rather than a classification score, is required for similarity ranking. A full-image representation can be preferred.

Multi-view Testing on Feature Maps

Summary and Results for ILSVRC 2014

Experiments on VOC 2007 Classification

Experiments on Caltech101

SPP-NET FOR OBJECT DETECTION

We extract the feature maps from the entire image only once (possibly at multiple scales). Then we apply the spatial pyramid pooling on each candidate window of the feature maps to pool a fixed-length representation of this window. Because the time-consuming convolutions are only applied once, our method can run orders of magnitude faster.

Detection Algorithm

Detection Results

Complexity and Running Time

Model Combination for Detection

We further find that the complementarity is mainly because of the convolutional layers.

ILSVRC 2014 Detection

CONCLUSION

APPENDIX A

Mean Subtraction.

Implementation of Pooling Bins.

For a pyramid level with $\times n$ bins, the $(i, j)$ -th bin is in the range of $[\lfloor \frac{i-1}{n} w \rfloor, \lceil \frac {i}{n} w \rceil] \times [\lfloor \frac{j-1}{n} h \rfloor, \lceil \frac{j}{n} h\rceil ]$