Summary
0. History
Existing deep convolutional neural networks (CNNs) require a fixed-size (224*224) input image. This is often generated by cropping or wrapping image, which provides unwanted problems, missing entire objects or causing unwanted geometric distortion for example. This requirement is 'artificial' and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale
1. Objective
Deep networks consist of conv layers and fc layers, the requirement of fixed-size input image is due to fixed-size input of the fc layers, as the conv layers use the sliding windows method which has no exigence on the size of input, so the key point is to adopt the input size of image to the size of fc layers
2. Model adopted
R-CNN model based, ZF-5 as baseline, SS to select region of interest, then in each candidate window, a 4-level spatial pyramid to pool the feature
3. Specialities of the system
a. specialities of SPP:
- able to generate a fixed-size output vector
- use multi-level spatial bins
- pool features extracted at variable scales (variable input scale which increases scale-invariance and reduce over-fitting)
b. improve 4 different CNN architectures
e. run the convolutional layers only once on the entire image and then extract features by SPP-net on the feature maps
f. function of global pooling :
- reduce model size and reduce overfitting
- used on the testing stage after fc layers to improve accuracy
- used for weakly supervised object recognition
g. the pooling layer output a fixed-size vector for different size of input images
h. training with variant size for each epoch, which increases the accuracy of system
i. multi-level pooling helps increase accuracy, not simply due to more parameters, rather, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout
j. full image representation is preferred and increases the accuracy,
k. multi-view testing, resize the image and select a fixed view to generate a set of different views of image(10-views: center and corner with flipped image, 18-views: +mid-side)
l. accuracy increased by combine two models with different conv layers
m. mapping a window to the feature map
4. Disadvantages
Still uses multi-stage system
5. Personnal reviews
Keep going...