论文笔记|Rich feature hierarchies for accurate object detection and semantic segmentation

Authors

Ross Girshick /Jeff Donahue/Trevor Darrell /Jitendra Malik
Ross Girshick
Ross Girshick

Abstract

R-CNN:Regions with CNN features. It combines two key insights:
1. apply cnns to bottom-up region proposals
2. supervised pre-training for an anuxiliary task(fine-tuning)

1 Introduction

1.1 from hog->cnns

The last decade of progress on various visual recognition tasks has been based on SIFT and HOG,which we could associate them with complex cells in V1, but they still perform poorly. we need more multi-stage processes for computing features.
Fukushinma–neocognitron– lacked a supervised training algorithm
LeCun et al. – SGD+backpropagation was effective for training CNNs
CNNs saw heavy use in 1990s, but then fell out of fashion with the rise of SVM, Krizhevsky et al. rekindled interest in CNNs,in ILSVRC 2012(rectifying non-linearities and “dropout” regularization) .
This paper is the first to show that a cnns can lead to dramatically higher object detection performance on PASCAL VOC than HOG-like features.

1.2 two problems

this paper focused on two problems: localizing objects with a deep network and training a high-capacity model with a small quantity of annotated data.
1. localization mathod
- as a regression problem
- sliding-window (loss precision) (overFeat)
2. fine-tuning

1.3 efficient

1.4 dominant error mode

a simple bounding-box regression method significantly reduce mislocalizations

2 Object detection with R-CNN

this system consists of three modules.
1. generates category-independent region proposals
2. extracts a fixed-length feature vector from region using cnn
3. linear SVMs

2.1 module design

2.1.1 region proposals

some examples: 1objectness, 2selective search, 3category-independent object proposals,4consitrained parametric min-cuts(CPMC),5multi-scale combinatorial grouping, 6ciresan….
we use selective search to enable a controlled comparsion with prior detection work.

J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.Selective search for object recognition. IJCV, 2013.

2.1.2 Feature extraction

Warp all pixels in a tight bounding box around it to the required size(227x227), prior to warping ,we dilate the box p(16) pixels.

2.2 test-time detection

This paper run selective search on the test image to extract around 2000 region proposals(fast mode), Given all scored regions (SVM) in an image, we apply a greedy non-maximum suppression that reject a region if it has an intersection over union**(IoU)** overlap with a higher socring selected region larger than a learned threshold.

2.2.1 run-time analysis

  1. cnn share parameters across all categories( and easier for different num_category)
  2. feature vectors computed by cnn are lower dimensional

13s/image on a gpu, 53s/image on a cup
feature matrix is 2000x4096, svm weight maxtrix is 4096xnumber of classes.

2.3 training

2.3.1 supervised pre-treining

caffe cnn library (nearly matches the performance of Krizhevsk et al.)

2.3.2 domain-specific fine-tuning

replacing the 1000way classification layer with a randomly initialized (N+1)-way classification layer.(plus 1 for background)
We treat all region proposals with >=0.5 IoU overlap with a ground-truth box as positives .
we start SGD at a learning rate of 0.001(1/10th of theinitial pre-training rate, not clobbering the initialization )
In each SGD iteration, we uniformly sample 32 positive windows + 96 background windows to construct a mini-batch of size 128.

2.3.2 Object category classifiers

for a background region, it is easy
but how to label a region taht partially overlaps a car. We resolve this issue with an IoU overlap threshold 0.3 (grid search over 0,0.1,0.2,…0.5, and defined differently in fine-tuning),below are defined as negatives.
Since the training data is too large to fit in memory, we adopt the standard hard negative mining method.

2.4 Results on PASCAL VOC 2010-2012

2010 53.7% 5011-2012 53.3% mAP
UVA same region proposal algorithm + four level spatial pyramid SIFT +nonlinear kernel svm–>35.1%

2.5 Results on ILSVRC2013

31.4%
OverFeat 24.3%

3 Visualization,ablation,and modes of error

3.1 Visualizing learned features

The idea is to single out a particular unit (feature) in the network and ust it as if it were an object detector in its own right.
The follow picture showing top regions for six pool5 units. each pool5 unit has a recptive field of 195x195.
这里写图片描述

3.2 Ablation studies

3.2.1 Performance layer-by-layer, without fine-tuning.

  1. Features from fc7 generalize worse than features from fc6, this means 29% of the parameters can be removed without degrading mAP.
  2. Pool4 features are computed using only6% parameters, but it can produces quite good results.
  3. 1,2–> Much of the CNN’s representational power comes from its convlutional layers .
  4. 3—> this finding suggests potential utility in computing a dense feature map (HOG-like) by using only the convolutional layers of CNN. This representaion would enable exprimentation with sliding-window detectors on top of pool5 features.
    这里写图片描述

3.2.2 Performance layer-by-layer, with fine-tuning.

  1. The boost from fine-tuning is much larger for fc6 and fc7.
  2. 1–>suggests that the pool5 features learned from Image Net are general and that most of the imporvement is gained from l**earning domain-specific non-linear classifiers on top of them.**

3.2.3 comparision to recent feature learning methods

table2 rows8-10

3.3 Network architectures

We have found that the choice of architecure has a large effect on R-CNN detection performance.
1. O-net (13layers of 3x3 convs and 5 pooling layers) outperforms T-net
2. a considerable drawback :7 times longer (forward) than T-net

3.4 Detection error analysis

CNN features are much more discriminative than HOG, loose loaclization likely resluts from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image calssification.
这里写图片描述

D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error
in object detectors. In
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值