Fast R-CNN
Introduction
Complexity arises because detection requires the accurate localization of objects, creating two primary challenges:
- First, numerous candidate object locations (often called “proposals”) must be processed.
- Second, these candidates provide only rough localization that must be refined to achieve precise localization.
We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations
R-CNN and SPPnet
- Training is a multi-stage pipeline.
- Training is expensive in space and time.
- Object detection is slow.
Contributions
- Higher detection quality (mAP) than R-CNN, SPPnet
- Training is single-stage, using a multi-task loss
- Training can update all network layers
- No disk storage is required for feature caching
Fast R-CNN architecture and training
A Fast R-CNN network takes as input an entire image and a set of object proposals.
The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:
- one that produces softmax probability estimates over K object classes plus a catch-all “background” class
- another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
The RoI pooling layer
RoI max pooling works by dividing the h × w h \times w h×w RoI window into an H × W H \times W H×W grid of sub-windows of approximate size h / H × w / W h/H \times w/W h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling.
The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level. We use the pooling sub-window calculation given in.
Initializing from pre-trained networks
First, the last max pooling layer is replaced by a RoI pooling layer
Second, the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier.
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
Fine-tuning for detection
Training all network weights with back-propagation is an important capability of Fast R-CNN.
why SPPnet is unable to update weights below the spatial pyramid pooling layer:
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, each RoI may have a very large receptive field, forward pass must process the entire receptive field
We propose a more efficient training method that takes advantage of feature sharing during training.
Sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes.
jointly optimizes a softmax classifier and bounding-box regressors
Multi-task loss.
A Fast R-CNN network has two sibling output layers.
- The first outputs a discrete probability distribution (per RoI), p = ( p 0 , … , p K ) p=(p_0, \dots, p_K) p=(p0,…,pK), over K + 1 K+1 K+1 categories.
- The second sibling layer outputs bounding-box regression offsets, t k = ( t x k , t y k , t w k , t h k ) t^k = (t^k_x, t^k_y, t^k_w, t^k_h) tk=(txk,tyk,twk,thk)
We use a multi-task loss
L
L
L on each labeled RoI to jointly train for classification and bounding-box regression:
L
(
p
,
u
,
t
u
,
v
)
=
L
cls
(
p
,
u
)
+
λ
[
u
≥
1
]
L
loc
(
t
u
,
v
)
L(p, u, t^u, v) = L_{\text{cls}}(p, u) + \lambda[u \ge 1]L_{\text{loc}}(t^u, v)
L(p,u,tu,v)=Lcls(p,u)+λ[u≥1]Lloc(tu,v)
L
cls
(
p
,
u
)
=
−
log
p
u
L_{\text{cls}}(p, u) = -\log p_u
Lcls(p,u)=−logpu
L
loc
(
t
u
,
v
)
=
∑
i
∈
x
,
y
,
w
,
h
smooth
L
1
(
t
i
u
v
i
)
L_{\text{loc}}(t^u, v) = \sum_{i \in {x, y, w, h}}\text{smooth}_{L_1}(t_i^u v_i)
Lloc(tu,v)=i∈x,y,w,h∑smoothL1(tiuvi)
smooth
L
1
(
x
)
=
{
0.5
x
2
if |x| < 1
∣
x
∣
0.5
otherwise
\text{smooth}_{L_1}(x) = \begin{cases} 0.5x^2 & \text{if |x| < 1} \\ |x| 0.5 & \text{otherwise} \end{cases}
smoothL1(x)={0.5x2∣x∣0.5if |x| < 1otherwise
L 1 L_1 L1 loss that is less sensitive to outliers than the L 2 L_2 L2 loss used in R-CNN and SPPnet
Mini-batch sampling.
Back-propagation through RoI pooling layers.
Let x i ∈ R x_i \in \mathbb R xi∈R be the i i i-th activation input into the RoI pooling layer and let y r j y_{rj} yrj be the layer’s j j j-th output from the r r rth RoI. The RoI pooling layer computes y r j = x i ∗ ( r , j ) y_{rj} = x_{i^*(r,j)} yrj=xi∗(r,j), in which i ∗ ( r , j ) = arg max i ′ ∈ R ( r , j ) x i ′ i^*(r, j) = \arg\max_{i' \in \mathcal R(r,j)}x_{i'} i∗(r,j)=argmaxi′∈R(r,j)xi′. R ( r , j ) \mathcal R(r,j) R(r,j) is the index set of inputs in the sub-window over which the output unit y r j y_{rj} yrj max pools. A single x i x_i xi may be assigned to several different outputs y r j y_{rj} yrj
∂ L ∂ x i = ∑ r ∑ j [ i = i ∗ ( r , j ) ] ∂ L ∂ y r j \frac{\partial L}{\partial x_i} = \sum_r \sum_j [i = i^*(r,j)] \frac{\partial L}{\partial y_{rj}} ∂xi∂L=r∑j∑[i=i∗(r,j)]∂yrj∂L
SGD hyper-parameters.
Scale invariance
two ways of achieving scale invariant object detection:
- via “brute force” learning and
- by using image pyramids.
Fast R-CNN detection
Truncated SVD for faster detection
A layer parameterized by the
u
×
v
u \times v
u×v weight matrix
W
W
W is approximately factorized as
W
≈
U
Σ
t
V
⊤
W \approx U \Sigma_t V^\top
W≈UΣtV⊤
Truncated SVD reduces the parameter count from u v uv uv to t ( u + v ) t(u+v) t(u+v), which can be significant if t t t is much smaller than min ( u , v ) \min(u, v) min(u,v).
To compress a network, the single fully connected layer corresponding to W W W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix Σ t V ⊤ \Sigma_tV^\top ΣtV⊤ (and no biases) and the second uses U U U (with the original biases associated with W W W ).
Main results
Experimental setup
VOC 2010 and 2012 results
VOC 2007 results
Training and testing time
Truncated SVD
Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.
Which layers to fine-tune?
Training through the RoI pooling layer is important for very deep nets.
In the smaller networks we find that conv1 is generic and task independent (a well-known fact)
Design evaluation
Does multi-task training help?
multi-task training improves pure classification accuracy relative to training for classification alone.
stage-wise training underperforms multi-task training
Scale invariance: to brute force or finesse?
brute-force learning (single scale) and image pyramids (multi-scale).
single-scale detection performs almost as well as multi-scale detection
Do we need more training data?
A good object detector should improve when supplied with more training data
Do SVMs outperform softmax?
softmax slightly outperforming SVM for all three networks. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches.
Are more proposals always better?
swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.
AR does not correlate well with mAP as the number of proposals per image is varied.