以下内容是在读完《Rich feature hierarchies for accurate object detection and semantic segmentation》后总结的些笔记,部分内容在后续更新Fast R-CNN时详细介绍。
Object Detection
Find all interested objects in the image and determine their location and category.
Two key insight of the approach
- “Recognition using region” paradigm: Apply high-capacity convolution neural networks to bottom-up region proposals in order to solve CNN localization problem.
- An effective paradigm for training CNNs when annotated detection data is scarce: Supervised pre-training on an auxiliary dataset(ILSVRC), followed by domain-specific fine-tuning on a small dataset(PASCAL VOC).
Algorithm flow
Selective Search(SS) generates around 2k bottom-up category-independent region proposals for each input image → Convert candidate regions into the form that is compatible with the CNN → Forward propagate each warped region through the CNN to extract a fixed-length feature vector → Score each extracted feature vector corresponding to a candidate region using class-specific SVMs trained → Apply NMS to reject regions that have an IoU overlap with a higher scoring selected region larger than a learned threshold → Apply a set of class-specific bounding-box regressors trained to refine bounding boxes.
Selective Search
Generating possible object locations for use in object recognition.
Advantage:
- Capture all scales and try to find all the objects: Objects can occur at any scale within the image.
- Diversification to generate high quality proposals: There is no single optimal strategy to group regions together.
- Fast to compute
Process:
Input: image || Output: L, set of object location hypotheses
refer to: https://blog.csdn.net/weixin_43694096/article/details/121610856
Object proposal transformation
Dilate 16 pixels and then resize directly. Regardless of the size or aspect ratio of candidate region, warp all pixels in a tight bounding box around it to the required size. Prior to warping, dilate the tight bounding box so that at the warped size there are exactly p(padding=16) pixels of warped image context around the original box.
refer to: https://www.jianshu.com/p/3a0a0e5a26a1
IoU
Intersection Over Union is used to measure the correlation between ground-truth and predicted box.
refer to: https://blog.csdn.net/u014061630/article/details/82818112
NMS
Non-maximum suppression for each class independently sorts the candidate boxes according to scores scored by corresponding category-specific SVM(Scores should be above the threshold to ensure the candidate box contains parts of object of corresponding category) and rejects regions having an IoU overlap with a higher scoring box larger than a learned threshold.
Process: Suppose there are 6 candidate boxes, sort them according to the scores of SVM. Suppose the probability of belonging to this category from small to large is A, B, C, D, E, F, respectively. Start from F scored the highest, calculate whether the IoU overlap between A to E respectively and F is higher than the set threshold. Suppose the IoU overlap between B, D respectively and F is higher than the threshold, reject B and D, and mark that F is remained. Then do the same for A, C and E, and repeat until finding all the boxes to be remained.
refer to: https://blog.csdn.net/shuzfan/article/details/52711706
Feature extractor training
- Pre-training: Model is capable of recognizing 1000 objects / convolutional layers have a good ability of feature extraction.(Input: ILSVRC2012 || Output: 1000-dimensional category results)
- Fine Tuning: Adapt CNN to warped proposal windows
Do: replace the CNN’s ImageNet-specific 1000-way classification layer with a randomly initialized (N + 1)-way classification layer
Dataset: PASCAL VOC2007
Input: 227 * 227 region proposals with ≥0.5 IoU overlap with a ground-truth box as positives(labeled as the matched ground-truth class) and the rest as negatives(labeled as background)
Output: 21-dimension category results(include background)
Class-specific linear SVMs training
Positive examples: only the ground-truth boxes for their respective classes
Negative examples: proposals with less than 0.3 IoU overlap with all instances of a class for that class
Class-specific bounding-box regressors training
For each category, a linear regression is used for refinement.★★★
refer to: https://www.cnblogs.com/oliyoung/p/Bounding-Box-Regression.htm
Ablation studies
- Performance layer-by-layer, without fine-tuning(ILSVRC 2012): Removing both fc6 and fc7 produces quite good results. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.
- Performance layer-by-layer, with fine-tuning(VOC 2007): The boost from fine-tuning is much larger for fc6 and fc7 than for pool5, which suggests that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them. (Conv for general features, fc for domain-specific tasks)
Weakness of R-CNN
- Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
- Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
- Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image on a GPU.