Fast R-CNN

最新推荐文章于 2024-07-22 09:26:29 发布

TifaBest

最新推荐文章于 2024-07-22 09:26:29 发布

阅读量282

点赞数

分类专栏：读后笔记文章标签：深度学习目标识别

本文链接：https://blog.csdn.net/Tifa_Best/article/details/88081513

版权

读后笔记专栏收录该内容

24 篇文章 0 订阅

订阅专栏

Fast R-CNN

Introduction

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges:

First, numerous candidate object locations (often called “proposals”) must be processed.
Second, these candidates provide only rough localization that must be refined to achieve precise localization.

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations

R-CNN and SPPnet

Training is a multi-stage pipeline.
Training is expensive in space and time.
Object detection is slow.

Contributions

Higher detection quality (mAP) than R-CNN, SPPnet
Training is single-stage, using a multi-task loss
Training can update all network layers
No disk storage is required for feature caching

Fast R-CNN architecture and training

Fast R-CNN

A Fast R-CNN network takes as input an entire image and a set of object proposals.

The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.

Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.

Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:

one that produces softmax probability estimates over K object classes plus a catch-all “background” class
another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

The RoI pooling layer

RoI max pooling works by dividing the $\times w$ RoI window into an $\times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling.

The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level. We use the pooling sub-window calculation given in.

Initializing from pre-trained networks

First, the last max pooling layer is replaced by a RoI pooling layer

Second, the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier.

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Fine-tuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN.

why SPPnet is unable to update weights below the spatial pyramid pooling layer:
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, each RoI may have a very large receptive field, forward pass must process the entire receptive field

We propose a more efficient training method that takes advantage of feature sharing during training.

Sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes.

jointly optimizes a softmax classifier and bounding-box regressors

Multi-task loss.
A Fast R-CNN network has two sibling output layers.

The first outputs a discrete probability distribution (per RoI), $p=(p_0, \dots, p_K)$ , over $K + 1$ categories.
The second sibling layer outputs bounding-box regression offsets, $t^k = (t^k_x, t^k_y, t^k_w, t^k_h)$

We use a multi-task loss $L$ on each labeled RoI to jointly train for classification and bounding-box regression:
$t^u, v) = L_{\text{cls}}(p, u) + \lambda[u \ge 1]L_{\text{loc}}(t^u, v)$
$L_{\text{cls}}(p, u) = -\log p_u$
$L_{\text{loc}}(t^u, v) = \sum_{i \in {x, y, w, h}}\text{smooth}_{L_1}(t_i^u v_i)$
$\text{smooth}_{L_1}(x) = \begin{cases} 0.5x^2 & \text{if |x| < 1} \\ |x| 0.5 & \text{otherwise} \end{cases}$

$L_1$ loss that is less sensitive to outliers than the $L_2$ loss used in R-CNN and SPPnet

Mini-batch sampling.

Back-propagation through RoI pooling layers.

Let $x_i \in \mathbb R$ be the $i$ -th activation input into the RoI pooling layer and let $y_{rj}$ be the layer’s $j$ -th output from the $r$ th RoI. The RoI pooling layer computes $y_{rj} = x_{i^*(r,j)}$ , in which $i^*(r, j) = \arg\max_{i' \in \mathcal R(r,j)}x_{i'}$ . $\mathcal R(r,j)$ is the index set of inputs in the sub-window over which the output unit $y_{rj}$ max pools. A single $x_i$ may be assigned to several different outputs $y_{rj}$

$\frac{\partial L}{\partial x_i} = \sum_r \sum_j [i = i^*(r,j)] \frac{\partial L}{\partial y_{rj}}$

SGD hyper-parameters.

Scale invariance

two ways of achieving scale invariant object detection:

via “brute force” learning and
by using image pyramids.

Fast R-CNN detection

Truncated SVD for faster detection

A layer parameterized by the $\times v$ weight matrix $W$ is approximately factorized as
$\approx U \Sigma_t V^\top$

Truncated SVD reduces the parameter count from $u v$ to $t (u + v)$ , which can be significant if $t$ is much smaller than $\min(u, v)$ .

To compress a network, the single fully connected layer corresponding to $W$ is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix $\Sigma_tV^\top$ (and no biases) and the second uses $U$ (with the original biases associated with $W$ ).

Main results

Experimental setup

VOC 2010 and 2012 results

VOC 2007 results

Training and testing time

Truncated SVD
Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.

Which layers to fine-tune?

Training through the RoI pooling layer is important for very deep nets.

In the smaller networks we find that conv1 is generic and task independent (a well-known fact)

Design evaluation

Does multi-task training help?

multi-task training improves pure classification accuracy relative to training for classification alone.

stage-wise training underperforms multi-task training

Scale invariance: to brute force or finesse?

brute-force learning (single scale) and image pyramids (multi-scale).

single-scale detection performs almost as well as multi-scale detection

Do we need more training data?

A good object detector should improve when supplied with more training data

Do SVMs outperform softmax?

softmax slightly outperforming SVM for all three networks. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches.