Fast R-CNN

Fast R-CNN

Introduction

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges:

  • First, numerous candidate object locations (often called “proposals”) must be processed.
  • Second, these candidates provide only rough localization that must be refined to achieve precise localization.

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations

R-CNN and SPPnet

  1. Training is a multi-stage pipeline.
  2. Training is expensive in space and time.
  3. Object detection is slow.

Contributions

  1. Higher detection quality (mAP) than R-CNN, SPPnet
  2. Training is single-stage, using a multi-task loss
  3. Training can update all network layers
  4. No disk storage is required for feature caching

Fast R-CNN architecture and training

Fast R-CNN

A Fast R-CNN network takes as input an entire image and a set of object proposals.

The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.

Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.

Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:

  • one that produces softmax probability estimates over K object classes plus a catch-all “background” class
  • another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

The RoI pooling layer

RoI max pooling works by dividing the h × w h \times w h×w RoI window into an H × W H \times W H×W grid of sub-windows of approximate size h / H × w / W h/H \times w/W h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling.

The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level. We use the pooling sub-window calculation given in.

Initializing from pre-trained networks

First, the last max pooling layer is replaced by a RoI pooling layer

Second, the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier.

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Fine-tuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN.

why SPPnet is unable to update weights below the spatial pyramid pooling layer:
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, each RoI may have a very large receptive field, forward pass must process the entire receptive field

We propose a more efficient training method that takes advantage of feature sharing during training.

Sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes.

jointly optimizes a softmax classifier and bounding-box regressors

Multi-task loss.
A Fast R-CNN network has two sibling output layers.

  1. The first outputs a discrete probability distribution (per RoI), p = ( p 0 , … , p K ) p=(p_0, \dots, p_K) p=(p0,,pK), over K + 1 K+1 K+1 categories.
  2. The second sibling layer outputs bounding-box regression offsets, t k = ( t x k , t y k , t w k , t h k ) t^k = (t^k_x, t^k_y, t^k_w, t^k_h) tk=(txk,tyk,twk,thk)

We use a multi-task loss L L L on each labeled RoI to jointly train for classification and bounding-box regression:
L ( p , u , t u , v ) = L cls ( p , u ) + λ [ u ≥ 1 ] L loc ( t u , v ) L(p, u, t^u, v) = L_{\text{cls}}(p, u) + \lambda[u \ge 1]L_{\text{loc}}(t^u, v) L(p,u,tu,v)=Lcls(p,u)+λ[u1]Lloc(tu,v)
L cls ( p , u ) = − log ⁡ p u L_{\text{cls}}(p, u) = -\log p_u Lcls(p,u)=logpu
L loc ( t u , v ) = ∑ i ∈ x , y , w , h smooth L 1 ( t i u v i ) L_{\text{loc}}(t^u, v) = \sum_{i \in {x, y, w, h}}\text{smooth}_{L_1}(t_i^u v_i) Lloc(tu,v)=ix,y,w,hsmoothL1(tiuvi)
smooth L 1 ( x ) = { 0.5 x 2 if |x| &lt; 1 ∣ x ∣ 0.5 otherwise \text{smooth}_{L_1}(x) = \begin{cases} 0.5x^2 &amp; \text{if |x| &lt; 1} \\ |x| 0.5 &amp; \text{otherwise} \end{cases} smoothL1(x)={0.5x2x0.5if |x| < 1otherwise

L 1 L_1 L1 loss that is less sensitive to outliers than the L 2 L_2 L2 loss used in R-CNN and SPPnet

Mini-batch sampling.

Back-propagation through RoI pooling layers.

Let x i ∈ R x_i \in \mathbb R xiR be the i i i-th activation input into the RoI pooling layer and let y r j y_{rj} yrj be the layer’s j j j-th output from the r r rth RoI. The RoI pooling layer computes y r j = x i ∗ ( r , j ) y_{rj} = x_{i^*(r,j)} yrj=xi(r,j), in which i ∗ ( r , j ) = arg ⁡ max ⁡ i ′ ∈ R ( r , j ) x i ′ i^*(r, j) = \arg\max_{i&#x27; \in \mathcal R(r,j)}x_{i&#x27;} i(r,j)=argmaxiR(r,j)xi. R ( r , j ) \mathcal R(r,j) R(r,j) is the index set of inputs in the sub-window over which the output unit y r j y_{rj} yrj max pools. A single x i x_i xi may be assigned to several different outputs y r j y_{rj} yrj

∂ L ∂ x i = ∑ r ∑ j [ i = i ∗ ( r , j ) ] ∂ L ∂ y r j \frac{\partial L}{\partial x_i} = \sum_r \sum_j [i = i^*(r,j)] \frac{\partial L}{\partial y_{rj}} xiL=rj[i=i(r,j)]yrjL

SGD hyper-parameters.

Scale invariance

two ways of achieving scale invariant object detection:

  1. via “brute force” learning and
  2. by using image pyramids.

Fast R-CNN detection

Truncated SVD for faster detection

A layer parameterized by the u × v u \times v u×v weight matrix W W W is approximately factorized as
W ≈ U Σ t V ⊤ W \approx U \Sigma_t V^\top WUΣtV

Truncated SVD reduces the parameter count from u v uv uv to t ( u + v ) t(u+v) t(u+v), which can be significant if t t t is much smaller than min ⁡ ( u , v ) \min(u, v) min(u,v).

To compress a network, the single fully connected layer corresponding to W W W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix Σ t V ⊤ \Sigma_tV^\top ΣtV (and no biases) and the second uses U U U (with the original biases associated with W W W ).

Main results

Experimental setup

VOC 2010 and 2012 results

VOC 2007 results

Training and testing time

Truncated SVD
Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.

Which layers to fine-tune?

Training through the RoI pooling layer is important for very deep nets.

In the smaller networks we find that conv1 is generic and task independent (a well-known fact)

Design evaluation

Does multi-task training help?

multi-task training improves pure classification accuracy relative to training for classification alone.

stage-wise training underperforms multi-task training

Scale invariance: to brute force or finesse?

brute-force learning (single scale) and image pyramids (multi-scale).

single-scale detection performs almost as well as multi-scale detection

Do we need more training data?

A good object detector should improve when supplied with more training data

Do SVMs outperform softmax?

softmax slightly outperforming SVM for all three networks. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches.

Are more proposals always better?

swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.

AR does not correlate well with mAP as the number of proposals per image is varied.

Preliminary MS COCO results

Conclusion

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值