论文阅读笔记（十一）：Fast R-CNN

最新推荐文章于 2022-10-26 13:31:03 发布

__Sunshine__

最新推荐文章于 2022-10-26 13:31:03 发布

阅读量559

点赞数 1

分类专栏：笔记文章标签： Fast R-CNN 论文翻译论文笔记

本文链接：https://blog.csdn.net/sunshine_010/article/details/79914244

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate.

这篇论文提出一种用于目标检测的Fast R-CNN算法。Fast R-CNN建立在之前的研究工作，使用深度卷积网络来高效的分类目标提案。相比于之前的工作，Fast R-CNN采用了一些创新来提高训练和测试的速度，同时也提高了检测的准确率。Fast R-CNN训练深度VGG16网络比训练R-CNN快9倍，在测试时快213倍，并且在PASCAL VOC 2012数据集上获得了一个更高的平均平均准确率（mAP）。

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

目前，深度卷积网络已经明显的提高了图像分类和目标检测的准确率。相比于图像分类，目标检测则是一项更具挑战性的任务，它需要更加复杂的方法来解决。由于这种复杂性，目前多级管道中所获得的模型速度缓慢而且粗糙。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

复杂性的产生是由于检测需要对目标进行精确的定位，这就产生了两个主要的挑战。第一，必须处理大量的候选目标位置（通常称为“提案”）；第二，这些候选目标位置仅提供一个粗略的定位，这就必须对其进行改进以提供更加精确的定位。解决这些问题往往会影响速度、准确性和简单性。

In this paper, we streamline the training process for stateof-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

在本文中，我们简化了最先进的基于卷积神经网络的目标检测器的训练过程，我们提出了一种单级的训练算法，这种算法同时学习分类目标的候选方案和改进他们的空间位置。

The Region-based Convolutional Network method (RCNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

R-CNN使用深度卷积网络来分类目标提案获得了非常好的目标检测准确率，但是，R-CNN有一些明显的缺点：

1.多阶段训练过程　R-CNN首先采用log损失在目标提案上微调卷积神经网络，然后，训练适合卷积网络特征的SVM，这些SMV作为目标检测器，使用微调来代替softmax分类器。在第三阶段，进行边界框回归。

2.训练空间和时间消耗大　对于SVM和边界框回归的训练，特征是从每一幅图像的每一个目标提案提取出来并写入磁盘中的。使用深度网络，例如：VGG16，对于VOC 2007 训练集的5K图像来说，这个过程要使用GPU训练两天半，这些特征需要数百GB的存储空间来存储。

3.目标检测速度慢　在测试时，特征是从每一幅测试图像的每一个目标提案中提取出来的，采用VGG16的检测器处理一幅图像需要47s（在GPU上）。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 × 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.

R-CNN速度慢是因为每一个目标提案都会通过卷积神经网络进行前向计算，而不共享计算。空间金字塔池化网络（SPPnet）通过共享计算加速了R-CNN。SPPnet方法为整个输入图像计算一个卷积特征映射，然后使用从共享特征映射中提取的特征向量对每个目标提案进行分类。通过最大池化提案内部的部分特征映射来形成一个固定大小的输出（例如：6x6）达到特征提取的目的。多种大小的输出汇集在一起，然后连接成空间金字塔池化（SPP）。在测试时，SPPnet加速了R-CNN10到100倍，在训练时，由于更快的提案特征提取过程，也加速了3倍。

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPPnet也有一些明显的缺点。像R-CNN一样，它的训练过程也是一个多阶段过程，这个过程围绕特征提取、采用log损失对网络进行微调、训练SVM和最后的拟合边界框回归展开。特征也要写入磁盘，但是，在[11]中提出微调算法不更新SPP之前的卷积层参数。不出所料，这些限制限制了深度网络的准确率。

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:

Higher detection quality (mAP) than R-CNN, SPPnet
Training is single-stage, using a multi-task loss
Training can update all network layers
No disk storage is required for feature caching

我们提出了一种新的算法来弥补R-CNN和SPPnet的不足，同时提升了它们的速度和准确率。我们称这种方法为Fast R-CNN，因为在训练和测试时相对较快。Fast R-CNN有如下优点：

比R-CNN和SPPnet更高的检测质量；
采用多任务损失，训练过程为单阶段；
训练可以更新所有网络层；
特征缓存不需要磁盘存储。

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

图1展示了Fast R-CNN的结构。Fast R-CNN网络将一幅完整的图像和一系列目标提案作为输入。该网络首先采用一些卷积层和最大池化层生成卷积特征映射来处理整个图像。然后，对于每一个目标提案，感兴趣区域（RoI）池化层从特征映射中提取出一个固定长度的特征向量。每一个特征向量被送到一系列的全连接层（fc）最终分支到两个同级输出层：一层是在所有K个目标类加上一个全方位的背景类产生softmax概率估计；另一层则对每个K类目标输出4个真实数字，每一组的4个值编码了一个K类目标的精确的边界框位置。

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI.

RoI池化层采用最大池化将任何有效的RoI内部特征转换为具有HxW固定空间范围的小的特征映射，H和W为层超参数，它们独立于任何的RoI。

When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H =W =7forVGG16).
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

当一个预训练的网络初始化一个Fast R-CNN网络时，它会经历三个转换。
首先，用一个RoI池化层替换最后一个最大池化层，它通过设置H和W来与网络的第一个全连接层相适应（例如：VGG16的H和W为7）。
然后，网络的最后一个全连接层和softmax（训练用于1000类ImageNet图像分类）用之前所描述的两个同级输出层（一个全连接层和K+1类的softmax和类特定的边界框回归）替换。
最后，该网络被修改为接受两个数据输入：一系列图像和一系列图像的RoI。

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

采用反向传播计算所有网络权重是Fast R-CNN的一项非常重要的能力，让我来解释一下为什么SPPnet在空间金字塔池化层下不能更新权重。

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是当来自于不同图像的训练样本通过SPP层时，它所使用的反向传播算法的效率是非常低的，这是由SPPnet和R-CNN的训练方式所决定的。这种低效源于这样一个事实，那就是每一个RoI有一个非常大的感受野，通常包含整个图像。由于前向传播必须处理整个感受野，而训练输入又很大（通常是整幅图像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

我们提出了一种更加有效的训练方式，那就是在训练时利用特征共享的优点。在Fast R-CNN的训练中，随机梯度下降（SGD）的小批采用分层次采样，首先采样N幅图像，然后从每幅图像中采样R/N个RoI。关键的是，来自同一图像的RoI在前向和后向过程中共享计算和内存。使用N的小批量减少小批量的计算量。例如当N等于2，R等于128时，这个训练过程要比从128幅不同的图像中采样一个RoI（即R-CNN和SPPnet的策略）快64倍。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

这一策略的一个担忧就是，这可能会导致训练收敛缓慢，因为同一幅图像中的RoI是具有相关性的。我们将N和R分别设置为2和128，并且使用比R-CNN更少的SGD迭代次数，我们取得了一个不错的结果，使得这种担忧没有成为一个实际问题。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分层抽样之外，Fast R-CNN使用了具有一个微调阶段的流线型训练过程，这个微调阶段联合优化了一个softmax分类器和边界框回归，而不是训练一个softmax分类器、SVM和三个独立阶段的回归。这个过程的组成部分（损失、小批量采样策略、RoI池化层的反向传播、SGD超参数）在下面进行讲述。