Part I. RCNN
Paper link:
http://arxiv.org/pdf/1311.2524v5.pdf
Github link:
https://github.com/rbgirshick/rcnn
Detailed notes:
http://zhangliliang.com/2014/07/23/paper-note-rcnn/
(Regarding to CNN training; pos/neg samples definition; performance of using different layers as feature maps; bounding box regression, etc.)
Notes:
1. architecture
2. advantages & disadvantages:
a. use CNN feature extraction as opposed to traditional feature learning methods.
b. around 2000 proposals to feed into CNN ==> computationally costly ==> SPP net
PART II. SPP net (spatial pyramid pooling)
paper link:
http://arxiv.org/pdf/1406.4729v4.pdf
github link:
https://github.com/ShaoqingRen/SPP_net
Detailed notes:
http://zhangliliang.com/2014/09/13/paper-note-sppnet/
Notes:
1. Motivation: CNN requires standard input size ⇒ crop/wrap leads to information loss ⇒ specifically, only FC need uniform size ⇒ construct SPP layer to transform various sizes of conv outputs to same size of FC input ⇒ application to RCNN: share conv layers for all proposals to reduce cost.
2. Architecture:
Input whole image ⇒ conv layers to get feature maps (256) ⇒ project proposal regions onto feature map (how? Discussed in detailed notes) ⇒ SPP layer: for each proposal, apply different pooling kernels to get 4x4, 2x2, 1x1 outputs (3 levels of pyramid) and concatenate them into a vector (16+4+1 = 21) (how to calculate window size and stride?Paper section 2.3) ⇒ FC + SVM + regression
3. Advantages: quicker (24x); multiple levels of pyramid help to extract different level of information from image, higher accuracy.
PART III. FAST RCNN
Notes:
-
Motivation: implement SPP to RCNN (RoI pooling); joint SVM, Bbox regression to RCNN
-
Architecture:
RoI pooling layer: 1 level SPP layer;
Multi-task loss layer:
Where u is true class, v is true regression object, p is prob vector, t = [deltax, deltay, width, height]
-
classification loss (Lcis): N+1 softmax loss (1 for background)
-
Regression loss (Lloc): 4*N regressor (for each class, output deltax, deltay, width and height)
PART IV. FASTER RCNN
Github link:https://github.com/ShaoqingRen/faster_rcnn
Notes:
1. Motivation: fast RCNN uses separate pipelines for making proposals and getting feature maps ⇒ making proposal could be done through CNN (Region proposal networks RPN)
2. Architecture of RPN
Whole image feeded into CNN(any benchmark)==> last conv layer ==> 3x3 sliding windows to look at pix’s neighbour, and each sliding position has 9 (3 ratios * 3 sizes) anchors (9 proposals) ==>all together (W*H*9) anchors⇒ 1x1 conv (look at channel’s infor) and get a vector ==> cls and reg loss layer
Multitask loss layer:
3. Training process of RPN and fast RCNN
phase 1: train RPN ⇒ get proposals ⇒ feed to fast RCNN and train
phase 2: feed RCNN convolution weight to RPN conv, keep RCNN & RPN conv layers learning rate =0, only train FC and loss layer of RPN ⇒ feed proposals to fast RCNN and train FC layer