[深度学习论文笔记][Object Detection] Rich feature hierarchies for accurate object detection and semantic seg

Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern
recognition. 2014. (Citations: 1977).


1 Motivation

Handle detection as classification. However, it needs to test many positions and scales, and use a computationally demanding classifier (CNN). Therefore, we only look at a tiny subset of possible positions selected by region proposal methods. 


Region proposals are like class-agnostic object detector. They find “blobby” image regions that are likely to contain objects. For example, selective search does bottom-up segmentation. It starts with raw image pixels, and in each step it merges adjacent pixels together if they have similar texture and color. The blobby-like regions are formed. At each scale, it converts regions to bounding boxes.


2 Pipeline


See Fig. The pipeline is as followings:
1. Fine-tuning. Download a ImageNet pre-trained model and throw away final fc layer. Reinitialize the final fc layer (20 + 1 output, 1 for background) from scratch and keep
training model using positive/negative regions from detection images. Positive regions are those which have ≥ 0.3 IoU overlap with a ground-truth box. Otherwise it is negative.
2. Extract features. Extract region proposals for all images (2 k category-independent region proposals per image). Then for each region proposal, warp it to the CNN input size, forward pass through CNN, save fc7 features to disk.
3. Classification. Train one binary linear SVM per class to classify region features.
4. Bounding box regression. For each class, train a linear regression model to map from cached features to offsets to gt boxes to make up for “slightly wrong” proposals.
5. NMS. Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an IoU overlap with a
higher scoring selected region larger than a learned threshold.

3 Function of Conv and FC Layers
Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.


The pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific fc layers on top of them.

4 Result
For VOC-07 dataset, compute mAP.
• R-CNN: 54.2%.
• R-CNN (with bounding box regression): 58.5%.

• R-CNN (use VGG-16 features): 66%.

5 References
[1]. http://techtalks.tv/talks/rich-feature-hierarchies-for-accurate-object-detection-and-semantic-segmentation/60254/.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值