目标检测经典论文翻译汇总:[翻译汇总]
翻译pdf文件下载:[下载地址]
此版为纯中文版,中英文对照版请稳步:[Fast R-CNN纯中文版]
Fast R-CNNRoss GirshickMicrosoft Research(微软研究院)rbg@microsoft.com |
Abstract
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.
摘要
本文提出了一种快速的基于区域的卷积网络方法(fast R-CNN)用于目标检测。Fast R-CNN建立在以前使用的深卷积网络有效地分类目标的成果上。相比于之前的研究工作,Fast R-CNN采用了多项创新提高了训练和测试速度,同时也提高了检测准确度。Fast R-CNN训练非常深的VGG16网络比R-CNN快9倍,测试时快213倍,并在PASCAL VOC上得到了更高的准确度。与SPPnet相比,Fast R-CNN训练VGG16网络比他快3倍,测试速度快10倍,并且更准确。Fast R-CNN的Python和C ++(使用Caffe)实现以MIT开源许可证发布在:https://github.com/rbgirshick/fast-rcnn。
1. Introduction
Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.
1. 引言
最近,深度卷积网络[14, 16]已经显著提高了图像分类[14]和目标检测[9, 19]的准确性。与图像分类相比,目标检测是一个更具挑战性的任务,需要更复杂的方法来解决。由于这种复杂性,当前的方法(例如,[9, 11, 19, 25])采用多级pipelines的方式训练模型,既慢且精度不高。
Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.
复杂性的产生是因为检测需要目标的精确定位,这就导致两个主要的难点。首先,必须处理大量候选目标位置(通常称为“proposals”)。 第二,这些候选框仅提供粗略定位,其必须被精细化以实现精确定位。 这些问题的解决方案经常会影响速度、准确性或简洁性。
In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
在本文中,我们简化了最先进的基于卷积网络的目标检测器的训练过程[9, 11]。我们提出一个单阶段训练算法,联合学习候选框分类和修正他们的空间位置。
The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).
最终方法能够训练非常深的检测网络(例如VGG16),其网络比R-CNN快9倍,比SPPnet快3倍。在运行时,检测网络在PASCAL VOC 2012数据集上实现最高准确度,其中mAP为66%(R-CNN为62%),每张图像处理时间为0.3秒,不包括候选框的生成(注:所有的时间都是使用一个超频875MHz的Nvidia K40 GPU测试的)。
1.1. RCNN and SPPnet
The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:
1. Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).
1.1. R-CNN与SPPnet
基于区域的卷积网络方法(R-CNN)[9]通过使用深度卷积网络来分类目标候选框,获得了很高的目标检测精度。然而,R-CNN具有明显的缺点:
1. 训练过程是多级pipeline。R-CNN首先使用目标候选框对卷积神经网络使用log损失进行fine-tune。然后,它将卷积神经网络得到的特征送入SVM。这些SVM作为目标检测器,替代通过fine-tune学习的softmax分类器。在第三个训练阶段,学习bounding-box回归器。
2. 训练在时间和空间上是的开销很大。对于SVM和bounding-box回归训练,从每个图像中的每个目标候选框提取特征,并写入磁盘。对于VOC07 trainval上的5k个图像,使用如VGG16非常深的网络时,这个过程在单个GPU上需要2.5天。这些特征需要数百GB的存储空间。
3. 目标检测速度很慢。在测试时,从每个测试图像中的每个目标候选框提取特征。用VGG16网络检测目标时,每个图像需要47秒(在GPU上)。
R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6×6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.
R-CNN很慢是因为它为每个目标候选框进行卷积神经网络前向传递,而没有共享计算。SPPnet网络[11]提出通过共享计算加速R-CNN。SPPnet计算整个输入图像的卷积特征图,然后使用从共享特征图提取的特征向量来对每个候选框进行分类。通过最大池化将候选框内的特征图转化为固定大小的输出(例如6×6)来提取针对候选框的特征。多输出尺寸被池化,然后连接成空间金字塔池[15]。SPPnet在测试时将R-CNN加速10到100倍。由于更快的候选框特征提取,训练时间也减少了3倍。
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.
SPP网络也有显著的缺点。像R-CNN一样,训练过程是一个多级pipeline,涉及提取特征、使用log损失对网络进行fine-tuning、训练SVM分类器以及最后拟合检测框回归。特征也要写入磁盘。但与R-CNN不同,在[11]中提出的fine-tuning算法不能更新在空间金字塔池之前的卷积层。不出所料,这种局限性(固定的卷积层)限制了深层网络的精度。
1.2. Contributions
We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:
1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching
1.2. 贡献
我们提出一种新的训练算法,修正了R-CNN和SPPnet的缺点,同时提高了速度和准确性。因为它能比较快地进行训练和测试,我们称之为Fast R-CNN。Fast RCNN方法有以下几个优点:
1. 比R-CNN和SPPnet具有更高的目标检测精度(mAP)。
2. 训练是使用多任务损失的单阶段训练。
3. 训练可以更新所有网络层参数。
4. 不需要磁盘空间缓存特征。
Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Fast R-CNN使用Python和C++(Caffe[13])编写,以MIT开源许可证发布在:https://github.com/rbgirshick/fast-rcnn。
2. Fast R-CNN architecture and training
Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
2. Fast R-CNN架构与训练
Fast R-CNN的架构如图1所示。Fast R-CNN网络将整个图像和一组候选框作为输入。网络首先使用几个卷积层(conv)和最大池化层来处理整个图像,以产生卷积特征图。然后,对于每个候选框,RoI池化层从特征图中提取固定长度的特征向量。每个特征向量被送入一系列全连接(fc)层中,其最终分支成两个同级输出层 :一个输出K个类别加上1个包含所有背景类别的Softmax概率估计,另一个层输出K个类别的每一个类别输出四个实数值。每组4个值表示K个类别中一个类别的修正后检测框位置。
图1. Fast R-CNN架构。输入图像和多个感兴趣区域(RoI)被输入到全卷积网络中。每个RoI被池化到固定大小的特征图中,然后通过全连接层(FC)映射到特征向量。网络对于每个RoI具有两个输出向量:Softmax概率和每类bounding-box回归偏移量。该架构是使用多任务损失进行端到端训练的。
2.1. The RoI pooling layer
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
2.1. RoI池化层
RoI池化层使用最大池化将任何有效的RoI内的特征转换成具有H×W(例如,7×7)的固定空间范围的小特征图,其中H和W是层的超参数,独立于任何特定的RoI。在本文中,RoI是卷积特征图中的一个矩形窗口。每个RoI由指定其左上角(r,c)及其高度和宽度(h,w)的四元组(r,c,h,w)定义。
RoI max pooling works by dividing the h×w RoI window into an H ×W grid of sub-windows of approximate size h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].
RoI最大池化通过将大小为h×w的RoI窗口分割成H×W个网格,子窗口大小约为h/H×w/W,然后对每个子窗口执行最大池化,并将输出合并到相应的输出网格单元中。同标准的最大池化一样,池化操作独立应用于每个特征图通道。RoI层只是SPPnets[11]中使用的空间金字塔池层的特例,其只有一个金字塔层。我们使用[11]中给出的池化子窗口计算方法。
2.2. Initializing from pre-trained networks
We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.
2.2 从预训练网络初始化
我们实验了三个预训练的ImageNet [4]网络,每个网络有五个最大池化层和5至13个卷积层(网络详细信息见4.1节