目标检测经典论文——Fast R-CNN论文翻译(中英文对照版):Fast R-CNN(Ross Girshick, Microsoft Research(微软研究院))

目标检测经典论文翻译汇总:[翻译汇总]

翻译pdf文件下载:[下载地址]

此版为纯中文版,中英文对照版请稳步:[Fast R-CNN纯中文版]

Fast R-CNN

Ross Girshick

Microsoft Research(微软研究院)

rbg@microsoft.com

Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.

摘要

本文提出了一种快速的基于区域的卷积网络方法(fast R-CNN)用于目标检测。Fast R-CNN建立在以前使用的深卷积网络有效地分类目标的成果上。相比于之前的研究工作,Fast R-CNN采用了多项创新提高了训练和测试速度,同时也提高了检测准确度。Fast R-CNN训练非常深的VGG16网络比R-CNN快9倍,测试时快213倍,并在PASCAL VOC上得到了更高的准确度。与SPPnet相比,Fast R-CNN训练VGG16网络比他快3倍,测试速度快10倍,并且更准确。Fast R-CNN的Python和C ++(使用Caffe)实现以MIT开源许可证发布在:https://github.com/rbgirshick/fast-rcnn。

1. Introduction

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

1. 引言

最近,深度卷积网络[14, 16]已经显著提高了图像分类[14]和目标检测[9, 19]的准确性。与图像分类相比,目标检测是一个更具挑战性的任务,需要更复杂的方法来解决。由于这种复杂性,当前的方法(例如,[9, 11, 19, 25])采用多级pipelines的方式训练模型,既慢且精度不高。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

复杂性的产生是因为检测需要目标的精确定位,这就导致两个主要的难点。首先,必须处理大量候选目标位置(通常称为“proposals”)。 第二,这些候选框仅提供粗略定位,其必须被精细化以实现精确定位。 这些问题的解决方案经常会影响速度、准确性或简洁性。

In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

在本文中,我们简化了最先进的基于卷积网络的目标检测器的训练过程[9, 11]。我们提出一个单阶段训练算法,联合学习候选框分类和修正他们的空间位置。

The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).

最终方法能够训练非常深的检测网络(例如VGG16),其网络比R-CNN快9倍,比SPPnet快3倍。在运行时,检测网络在PASCAL VOC 2012数据集上实现最高准确度,其中mAP为66%(R-CNN为62%),每张图像处理时间为0.3秒,不包括候选框的生成(注:所有的时间都是使用一个超频875MHz的Nvidia K40 GPU测试的)。

1.1. RCNN and SPPnet

The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

1. Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.

2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.

3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

1.1. R-CNN与SPPnet

基于区域的卷积网络方法(R-CNN)[9]通过使用深度卷积网络来分类目标候选框,获得了很高的目标检测精度。然而,R-CNN具有明显的缺点:

1. 训练过程是多级pipelineR-CNN首先使用目标候选框对卷积神经网络使用log损失进行fine-tune。然后,它将卷积神经网络得到的特征送入SVM。这些SVM作为目标检测器,替代通过fine-tune学习的softmax分类器。在第三个训练阶段,学习bounding-box回归器。

2. 训练在时间和空间上是的开销很大。对于SVM和bounding-box回归训练,从每个图像中的每个目标候选框提取特征,并写入磁盘。对于VOC07 trainval上的5k个图像,使用如VGG16非常深的网络时,这个过程在单个GPU上需要2.5天。这些特征需要数百GB的存储空间。

3. 目标检测速度很慢。在测试时,从每个测试图像中的每个目标候选框提取特征。用VGG16网络检测目标时,每个图像需要47秒(在GPU上)。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6×6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.

R-CNN很慢是因为它为每个目标候选框进行卷积神经网络前向传递,而没有共享计算。SPPnet网络[11]提出通过共享计算加速R-CNN。SPPnet计算整个输入图像的卷积特征图,然后使用从共享特征图提取的特征向量来对每个候选框进行分类。通过最大池化将候选框内的特征图转化为固定大小的输出(例如6×6)来提取针对候选框的特征。多输出尺寸被池化,然后连接成空间金字塔池[15]。SPPnet在测试时将R-CNN加速10到100倍。由于更快的候选框特征提取,训练时间也减少了3倍。

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPP网络也有显著的缺点。像R-CNN一样,训练过程是一个多级pipeline,涉及提取特征、使用log损失对网络进行fine-tuning、训练SVM分类器以及最后拟合检测框回归。特征也要写入磁盘。但与R-CNN不同,在[11]中提出的fine-tuning算法不能更新在空间金字塔池之前的卷积层。不出所料,这种局限性(固定的卷积层)限制了深层网络的精度。

1.2. Contributions

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:

1. Higher detection quality (mAP) than R-CNN, SPPnet

2. Training is single-stage, using a multi-task loss

3. Training can update all network layers

4. No disk storage is required for feature caching

1.2. 贡献

我们提出一种新的训练算法,修正了R-CNN和SPPnet的缺点,同时提高了速度和准确性。因为它能比较快地进行训练和测试,我们称之为Fast R-CNN。Fast RCNN方法有以下几个优点:

1. 比R-CNN和SPPnet具有更高的目标检测精度(mAP)。

2. 训练是使用多任务损失的单阶段训练。

3. 训练可以更新所有网络层参数。

4. 不需要磁盘空间缓存特征。

Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

Fast R-CNN使用Python和C++(Caffe[13])编写,以MIT开源许可证发布在:https://github.com/rbgirshick/fast-rcnn

2. Fast R-CNN architecture and training

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.

2. Fast R-CNN架构与训练

Fast R-CNN的架构如图1所示。Fast R-CNN网络将整个图像和一组候选框作为输入。网络首先使用几个卷积层(conv)和最大池化层来处理整个图像,以产生卷积特征图。然后,对于每个候选框,RoI池化层从特征图中提取固定长度的特征向量。每个特征向量被送入一系列全连接(fc)层中,其最终分支成两个同级输出层 :一个输出K个类别加上1个包含所有背景类别的Softmax概率估计,另一个层输出K个类别的每一个类别输出四个实数值。每组4个值表示K个类别中一个类别的修正后检测框位置。

1. Fast R-CNN架构。输入图像和多个感兴趣区域(RoI)被输入到全卷积网络中。每个RoI被池化到固定大小的特征图中,然后通过全连接层(FC)映射到特征向量。网络对于每个RoI具有两个输出向量:Softmax概率和每类bounding-box回归偏移量。该架构是使用多任务损失进行端到端训练的。

2.1. The RoI pooling layer

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

2.1. RoI池化层

RoI池化层使用最大池化将任何有效的RoI内的特征转换成具有H×W(例如,7×7)的固定空间范围的小特征图,其中H和W是层的超参数,独立于任何特定的RoI。在本文中,RoI是卷积特征图中的一个矩形窗口。每个RoI由指定其左上角(r,c)及其高度和宽度(h,w)的四元组(r,c,h,w)定义。

RoI max pooling works by dividing the h×w RoI window into an H ×W grid of sub-windows of approximate size h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI最大池化通过将大小为h×w的RoI窗口分割成H×W个网格,子窗口大小约为h/H×w/W,然后对每个子窗口执行最大池化,并将输出合并到相应的输出网格单元中。同标准的最大池化一样,池化操作独立应用于每个特征图通道。RoI层只是SPPnets[11]中使用的空间金字塔池层的特例,其只有一个金字塔层。我们使用[11]中给出的池化子窗口计算方法。

2.2. Initializing from pre-trained networks

We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

2.2 从预训练网络初始化

我们实验了三个预训练的ImageNet [4]网络,每个网络有五个最大池化层和5至13个卷积层(网络详细信息见4.1节)。当预训练网络初始化Fast R-CNN网络时,其经历三个变换。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

首先,最后的最大池化层由RoI池层代替,其将H和W设置为与网络的第一个全连接层兼容的配置(例如,对于VGG16,H=W=7)。

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).

其次,网络的最后一个全连接层和Softmax(其被训练用于1000类ImageNet分类)被替换为前面描述的两个同级层(全连接层和K+1个类别的Softmax以及特定类别的bounding-box回归)。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

最后,网络被修改为采用两个数据输入:图像的列表和这些图像中的RoI的列表。

2.3. Finetuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

2.3 检测任务fine-tune

用反向传播训练所有网络权重是Fast R-CNN的重要能力。首先,让我们阐明为什么SPPnet无法更新低于空间金字塔池化层的权重。

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是当每个训练样本(即RoI)来自不同的图像时,通过SPP层的反向传播是非常低效的,这正是训练R-CNN和SPPnet网络的方法。低效是因为每个RoI可能具有非常大的感受野,通常跨越整个输入图像。由于正向传播必须处理整个感受野,训练输入很大(通常是整个图像)。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

我们提出了一种更有效的训练方法,利用训练期间的特征共享。在Fast R-CNN网络训练中,随机梯度下降(SGD)的小批量是被分层采样的,首先采样N个图像,然后从每个图像采样R/N个RoI。关键的是,来自同一图像的RoI在前向和后向传播中共享计算和内存。减小N,就减少了小批量的计算。例如,当N=2和R=128时,得到的训练方案比从128幅不同的图采样一个RoI(即R-CNN和SPPnet的策略)快64倍。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

这个策略的一个令人担心的问题是它可能导致训练收敛变慢,因为来自相同图像的RoI是相关的。这个问题似乎在实际情况下并不存在,当N=2和R=128时,我们使用比R-CNN更少的SGD迭代就获得了良好的结果。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分层采样,Fast R-CNN使用了一个精细的训练过程,在fine-tune阶段联合优化Softmax分类器和bounding-box回归,而不是分别在三个独立的阶段训练softmax分类器、SVM和回归器[9, 11]。下面将详细描述该过程(损失、小批量采样策略、通过RoI池化层的反向传播和SGD超参数)。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p=(p0,…, pK), over K+1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, tk=(tkx, tky, tkw, tkh), for each of the K object classes, indexed by k. We use the parameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.

多任务损失。Fast R-CNN网络具有两个同级输出层。第一个输出在K+1个类别上的离散概率分布(每个RoI),p=(p0,…,pK)。通常,基于全连接层的K+1个输出通过Softmax来计算p。第二个输出层输出bounding-box回归偏移,即tk=(tkx, tky, tkw, tkh),k表示K个类别的索引。我们使用[9]中给出方法对tk进行参数化,其中tk指定相对于候选框的尺度不变转换和对数空间高度/宽度移位。

Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:

in which Lcls(p, u) = - log pu is log loss for true class u.

每个训练的RoI用类真值u和bounding-box回归目标真值v打上标签。我们对每个标记的RoI使用多任务损失L以联合训练分类和bounding-box回归:

其中Lcls(p, u) = - log pu,是类真值u的log损失。

The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple tu = (tux ; tuy ; tuw; tuh ), again for class u. The Iverson bracket indicator function [u≥1] evaluates to 1 when u≥1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. For bounding-box regression, we use the loss

in which

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.

对于类真值u,第二个损失Lloc是定义在bounding-box回归目标真值元组u, v = (vx, vy, vw, vh)和预测元组tu=(tux,tuy,tuw,tuh)上的损失。Iverson括号指示函数[u≥1],当u≥1的时候值为1,否则为0。按照惯例,任何背景类标记为u=0。对于背景RoI,没有检测框真值的概念,因此Lloc被忽略。对于检测框回归,我们使用损失:

其中:

是鲁棒的L1损失,对于异常值比在R-CNN和SPPnet中使用的L2损失更不敏感。当回归目标无界时,具有L2损失的训练可能需要仔细调整学习速率,以防止爆炸梯度。公式(3)消除了这种敏感性。

The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use  = 1.

公式(1)中的超参数λ控制两个任务损失之间的平衡。我们将回归目标真值vi归一化为具有零均值和方差为1的分布。所有实验都使用λ=1。

We note that [6] uses a related loss to train a class-agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

我们注意到[6]使用相关损失来训练一个类别无关的目标候选网络。与我们的方法不同的是,[6]倡导一个将定位和分类分离的双网络系统。OverFeat[19]、R-CNN[9]和SPPnet[11]也训练分类器和检测框定位器,但是这些方法使用逐级训练,这对于Fast R-CNN来说不是最好的选择。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u≥1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5], following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

小批量采样。在fine-tune期间,每个SGD的小批量由N=2个图像构成,均匀地随机选择(如通常的做法,我们实际上迭代数据集的排列)。 我们使用大小为R=128的小批量,从每个图像采样64个RoI。如在[9]中,我们从候选框中获取25%的RoI,这些候选框与检测框真值的交并比IoU至少为0.5。这些RoI只包括用前景对象类标记的样本,即u≥1。根据[11],剩余的RoI从候选框中采样,该候选框与检测框真值的最大IoU在区间[0.1, 0.5]。这些是背景样本,并用u=0标记。0.1的阈值下限似乎充当困难样本重挖掘的启发式算法[8]。在训练期间,图像以概率0.5水平翻转。不使用其他数据增强。

Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.

通过RoI池化层的反向传播。反向传播通过RoI池化层。为了清楚起见,我们假设每个小批量(N=1)只有一个图像,扩展到N>1是显而易见的,因为前向传播独立地处理所有图像。

Let xi∈R be the i-th activation input into the RoI pooling layer and let yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes yrj = xi*(r, j), in which i*(r, j) = argmax xi’ . R(r; j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different outputs yrj .

令RoI池化层的第i个激活输入xi∈R,令yrj是第r个RoI层的第j个输出。RoI池化层计算yrj = xi*(r, j),其中i*(r, j) = argmax xi’ . R(r; j)是输出单元yrj最大池化的子窗口中的输入的索引集合。一个xi可以被分配给几个不同的输出yrj。

The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable xi by following the argmax switches:

RoI池化层反向传播函数通过遵循argmax switches来计算关于每个输入变量xi的损失函数的偏导数:

In words, for each mini-batch RoI r and for each pooling output unit yrj, the partial derivative ∂L/∂yrj is accumulated if i is the argmax selected for yrj by max pooling. In back-propagation, the partial derivatives ∂L/∂yrj are already computed by the backwards function of the layer on top of the RoI pooling layer.

换句话说,对于每个小批量RoI r和对于每个池化输出单元yrj,如果iyrj通过最大池化选择的argmax,则将这个偏导数∂L/∂yrj积累下来。在反向传播中,偏导数∂L/∂yrj已经由RoI池化层顶部的层的反向传播函数计算。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0.0005 (on weights and biases) are used.

SGD超参数。用于Softmax分类和检测框回归的全连接层的权重分别使用具有方差0.01和0.001的零均值高斯分布初始化。偏置初始化为0。所有层的权重学习率为1倍的全局学习率,偏置为2倍的全局学习率,全局学习率为0.001。当对VOC07或VOC12 trainval训练时,我们进行30k次小批量SGD迭代,然后将学习率降低到0.0001,再训练10k次迭代。当我们训练更大的数据集,我们运行SGD更多的迭代,如下文所述。使用0.9的动量和0.0005的参数衰减(权重和偏置)。

2.4. Scale invariance

We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.

2.4. 尺度不变性

我们探索两种实现尺度不变目标检测的方法:(1)通过“brute force”学习和(2)通过使用图像金字塔。这些策略遵循[11]中的两种方法。在“brute force”方法中,在训练和测试期间以预定义的像素大小处理每个图像。网络必须直接从训练数据学习尺度不变性目标检测。

The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

相反,多尺度方法通过图像金字塔向网络提供近似尺度不变性。 在测试时,图像金字塔用于大致缩放-规范化每个候选框。按照[11]中的方法,作为数据增强的一种形式,在多尺度训练期间,我们在每次图像采样时随机采样金字塔尺度。由于GPU内存限制,我们只对较小的网络进行多尺度训练。

3. Fast R-CNN detection

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].

3. Fast R-CNN检测

一旦Fast R-CNN网络被fine-tune完毕,检测相当于运行前向传播(假设候选框是预先计算的)。网络将输入图像(或图像金字塔,编码为图像列表)和待计算得分的R个候选框的列表作为输入。在测试的时候,R通常在2000左右,尽管我们需要考虑更大(约45k)的情况。当使用图像金字塔时,每个RoI被缩放,使其最接近[11]中的2242个像素。

For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k|r) ≜ pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].

对于每个测试的RoI r,正向传播输出类别后验概率分布p和相对于r的预测检测框偏移集合(K个类别中的每个类别获得其自己的修正的检测框预测结果)。我们使用估计的概率Pr(class=k|r)≜pk为每个对象类别k分配r的检测置信度。然后,我们使用R-CNN [9]算法的设置和算法对每个类别独立执行非极大值抑制。

3.1. Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].

Figure 2. Timing for VGG16 before and after truncated SVD. Before SVD, fully connected layers fc6 and fc7 take 45% of the time.

3.1. 使用截断的SVD实现更快的检测

对于整个图像进行分类任务时,与卷积层相比,计算全连接层花费的时间较小。相反,在图像检测任务中,要处理大量的RoI,并且接近一半的前向传播时间用于计算全连接层(参见图2)。较大的全连接层可以轻松地通过用截短的SVD[5, 23]压缩来提升速度。

2. 截断SVD之前和之后VGG16的时间分布。在SVD之前,全连接层fc6和fc7消耗45%的时间。

In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as

using SVD. In this factorization, U is a u × t matrix comprising the first t left-singular vectors of W, Et is a t diagonal matrix containing the top t singular values of W, and V is v × t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix EtVT (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

在这种技术中,层的u × v权重矩阵W通过SVD被近似分解为:

在这种分解中,U是一个u×t的矩阵,包括W的前t个左奇异向量,Et是t×t对角矩阵,其包含W的前t个奇异值,并且V是v×t矩阵,包括W的前t个右奇异向量。截断SVD将参数计数从uv减少到t(u+v)个,如果t远小于min(u,v),则是非常有意义的。为了压缩网络,对应于W的单个全连接层由两个全连接层替代,在它们之间没有非线性。这些层中的第一层使用权重矩阵EtVT(没有偏置),并且第二层使用U(其中原始偏差与W相关联)。当RoI的数量较大时,这种简单的压缩方法能实现很好的加速。

4. Main results

Three main results support this paper’s contributions:

1. State-of-the-art mAP on VOC07, 2010, and 2012

2. Fast training and testing compared to R-CNN, SPPnet

3. Fine-tuning conv layers in VGG16 improves mAP

4. 主要结果

三个主要结果支持本文的贡献:

1. VOC07,2010和2012的最高的mAP。

2. 相比R-CNN、SPPnet,训练和测试的速度更快。

3. 对VGG16卷积层Fine-tuning后提升了mAP。

4.1. Experimental setup

Our experiments use three pre-trained ImageNet models that are available online. The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG_CNN_M_1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use single-scale training and testing (s = 600; see Section 5.2 for details).

4.1. 实验设置

我们的实验使用了三个经过预训练的ImageNet网络模型,这些模型可以在线获得(脚注:https://github.com/BVLC/caffe/wiki/Model-Zoo)。第一个是来自R-CNN [9]的CaffeNet(实质上是AlexNet[14])。 我们将这个CaffeNet称为模型S,即小模型。第二网络是来自[3]的VGG_CNN_M_1024,其具有与S相同的深度,但是更宽。我们把这个网络模型称为M,即中等模型。最后一个网络是来自[20]的非常深的VGG16模型。由于这个模型是最大的,我们称之为L。在本节中,所有实验都使用单尺度训练和测试(s=600,详见5.2节)。

4.2. VOC 2010 and 2012 results

On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3). For the NUS_NIN_c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.

Table 2. VOC 2010 test detection average precision (%). BabyLearning uses a network based on [17]. All other methods use VGG16. Training set key: 12: VOC12 trainval, Prop.: proprietary dataset, 12+seg: 12 with segmentation annotations, 07++12: union of VOC07 trainval, VOC07 test, and VOC12 trainval.

Table 3. VOC 2012 test detection average precision (%). BabyLearning and NUS NIN c2000 use networks based on [17]. All other methods use VGG16. Training set key: see Table 2, Unk.: unknown.

4.2. VOC 2010和2012数据集上的结果

(如上面表2,表3所示)在这些数据集上,我们比较Fast R-CNN(简称FRCN)和公共排行榜中comp4(外部数据)上的主流方法(脚注:http://host.robots.ox.ac.uk:8080/leaderboard)。对于NUS_NIN_c2000和BabyLearning方法,目前没有相关的出版物,我们无法找到有关所使用的ConvNet体系结构的确切信息;它们是Network-in-Network的变体[17]。所有其他方法都通过相同的预训练VGG16网络进行了初始化。

2. VOC 2010测试检测平均精度(%)。BabyLearning使用基于[17]的网络。所有其他方法使用VGG16。训练集关键字:12代表VOC12 trainval,Prop.代表专有数据集,12+seg代表具有分割注释的VOC2012,07++12代表VOC2007 trainval、VOC2007 test和VOC2012 trainval的组合。

表3. VOC 2012测试检测平均精度(%)。BabyLearning和NUS_NIN_c2000使用基于[17]的网络。所有其他方法使用VGG16。训练设置:见表2,Unk.代表未知。

Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.

Fast R-CNN在VOC12上获得最高结果,mAP为65.7%(加上额外数据为68.4%)。它也比其他方法快两个数量级,这些方法都基于比较“慢”的R-CNN网络。在VOC10上,SegDeepM [25]获得了比Fast R-CNN更高的mAP(67.2%对比66.1%)。SegDeepM使用VOC12 trainval训练集及分割标注进行了训练,它被设计为通过使用马尔可夫随机场推理R-CNN检测和来自O2P [1]的语义分割方法的分割来提高R-CNN精度。Fast R-CNN可以替换SegDeepM中使用的R-CNN,这可以获得更好的结果。当使用扩大的07++12训练集(见表2标题)时,Fast R-CNN的mAP增加到68.8%,超过SegDeepM。

4.3. VOC 2007 results

On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). R-CNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.

4.3. VOC 2007数据集上的结果

在VOC07数据集上,我们比较Fast R-CNN与R-CNN和SPPnet的mAP。所有方法从相同的预训练VGG16网络开始,并使用bounding-box回归。VGG16 SPPnet结果由论文[11]的作者提供。SPPnet在训练和测试期间使用五个尺度。Fast R-CNN对SPPnet的改进说明,即使Fast R-CNN使用单个尺度训练和测试,卷积层fine-tuning在mAP中贡献了很大的改进(从63.1%到66.9%)。R-CNN的mAP为66.0%。其次,SPPnet是在PASCAL中没有被标记为“困难”的样本上进行了训练。除去这些样本,Fast R-CNN的mAP达到68.1%。所有其他实验都使用被标记为“困难”的样本。

4.4. Training and testing time

Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.

Table 4. Runtime comparison between the same models in Fast RCNN,

R-CNN, and SPPnet. Fast R-CNN uses single-scale mode. SPPnet uses the five scales specified in [11]. †Timing provided by the authors of [11]. Times were measured on an Nvidia K40 GPU.

4.4. 训练和测试时间

快速的训练和测试是我们的第二个主要成果。表4比较了Fast RCNN,R-CNN和SPPnet之间的训练时间(单位小时),测试速率(每秒图像数)和VOC07上的mAP。对于VGG16,没有截断SVD的Fast R-CNN处理图像比R-CNN快146倍,有截断SVD的R-CNN快213倍。训练时间减少9倍,从84小时减少到9.5小时。与SPPnet相比,没有截断SVD的Fast RCNN训练VGG16网络比SPPnet快2.7倍(9.5小时相比于25.5小时),测试时间快7倍,有截断SVD的Fast RCNN比的SPPnet快10倍。Fast R-CNN还不需要数百GB的磁盘存储,因为它不缓存特征。

4. Fast RCNNR-CNNSPPnet中相同模型之间的运行时间比较。Fast R-CNN使用单尺度模式。SPPnet使用[11]中指定的五个尺度。†的时间由[11]的作者提供。在Nvidia K40 GPU上的进行了时间测量。

Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088×4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096×4096 fc7 layer reduces runtime with little loss in mAP. Further speed-ups are possible with smaller drops in mAP if one fine-tunes again after compression.

截断的SVD。截断的SVD可以将检测时间减少30%以上,同时能保持mAP只有很小(0.3个百分点)的下降,并且无需在模型压缩后执行额外的fine-tune。图2显示了如何使用来自VGG16的fc6层中的25088×4096矩阵的顶部1024个奇异值和来自fc7层的4096×4096矩阵的顶部256个奇异值减少运行时间,而mAP几乎没有损失。如果在压缩之后再次fine-tune,则可以在mAP更小下降的情况下进一步提升速度。

4.5. Which layers to fine-tune?

For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets.

Table 5. Effect of restricting which layers are fine-tuned for VGG16. Fine-tuning ≥ fc6 emulates the SPPnet training algorithm [11], but using a single scale. SPPnet L results were obtained using five scales, at a significant (7×) speed cost.

4.5. fine-tune哪些层?

对于SPPnet论文[11]中提到的不太深的网络,仅fine-tuning全连接层似乎足以获得良好的准确度。我们假设这个结果不适用于非常深的网络。为了验证fine-tune卷积层对于VGG16的重要性,我们使用Fast R-CNN进行fine-tune,但冻结十三个卷积层,以便只有全连接层学习。这种消融模拟了单尺度SPPnet训练,将mAP从66.9%降低到61.4%(如表5所示)。这个实验验证了我们的假设:通过RoI池化层的训练对于非常深的网是重要的。

表5. VGG16 fine-tune的层进行限制产生的影响。fine-tune fc6以上的层模拟了单尺度SPPnet训练算法[11]。SPPnet L是使用五个尺度,以显著(7倍)的速度成本获得的结果。

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3_1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2_1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3_1; and (2) updating from conv1_1 over-runs GPU memory. The difference in mAP when learning from conv2_1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3_1 and up; all experiments with models S and M fine-tune layers conv2 and up.

这是否意味着所有卷积层应该进行fine-tune?简而言之,不是的。在较小的网络(S和M)中,我们发现conv1(译者注:第一个卷积层)是通用的、不依赖于特定任务的(一个众所周知的事实[14])。允许conv1学习或不学习,对mAP没有很关键的影响。对于VGG16,我们发现只需要更新conv3_1及以上(13个卷积层中的9个)的层。这个观察结果是实用的:(1)与从conv3_1更新相比,从conv2_1更新使训练变慢1.3倍(12.5小时对比9.5小时),(2)从conv1_1更新时GPU内存不够用。从conv2_1学习时mAP仅增加0.3个点(如表5最后一列所示)。本文中所有Fast R-CNN的结果都fine-tune VGG16 conv3_1及以上的层,所有用模型S和M的实验fine-tune conv2及以上的层。

5. Design evaluation

We conducted experiments to understand how Fast R-CNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.

5. 设计评估

我们通过实验来了解Fast RCNN与R-CNN和SPPnet的比较,以及评估设计决策。按照最佳实践,我们在PASCAL VOC07数据集上进行了这些实验。

5.1. Does multi-task training help?

Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?

5.1. 多任务训练有用吗?

多任务训练是方便的,因为它避免管理顺序训练任务的pipeline。但它也有可能改善结果,因为任务通过共享的表示(ConvNet)[2]相互影响。多任务训练能提高Fast R-CNN中的目标检测精度吗?

To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ= 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ=1), but we disable bounding-box regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.

Table 6. Multi-task training (forth column per group) improves mAP over piecewise training (third column per group).

为了测试这个问题,我们训练仅使用公式(1)中的分类损失Lcls(即设置λ=0)的基准网络。这些baselines是表6中每组的第一列。请注意,这些模型没有bounding-box回归器。接下来(每组的第二列),是我们采用多任务损失(公式(1),λ=1)训练的网络,但是我们在测试时禁用bounding-box回归。这隔离了网络的分类准确性,并允许与基准网络的相似类别之类的比较(译者注:apples-to-apples comparision意思是比较两个相同类别的事或物)。

表6. 多任务训练(每组第四列)改进了分段训练(每组第三列)的mAP。

Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.

在所有三个网络中,我们观察到多任务训练相对于单独的分类训练提高了纯分类准确度。改进范围从+0.8到+1.1个mAP点,显示了多任务学习的一致的积极效果。

Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Lloc while keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).

最后,我们采用baseline模型(仅使用分类损失进行训练),加上bounding-box回归层,并使用Lloc训练它们,同时保持所有其他网络参数冻结。每组中的第三列显示了这种逐级训练方案的结果:mAP相对于第一列有改进,但逐级训练表现不如多任务训练(每组第四列)。

5.2. Scale invariance: to brute force or finesse?

We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multi-scale). In either case, we define the scale s of an image to be the length of its shortest side.

5.2. 尺度不变性:暴力或精细?

我们比较两个策略实现尺度不变物体检测:暴力学习(单尺度)和图像金字塔(多尺度)。在任一情况下,我们将尺度s定义为图像短边的长度。

All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.

所有单尺度实验使用s=600像素,对于一些图像,s可以小于600,因为我们保持横纵比缩放图像,并限制其最长边为1000像素。选择这些值使得VGG16在fine-tune期间不至于GPU内存不足。较小的模型占用显存更少,所以可受益于较大的s值。然而,每个模型的优化不是我们的主要的关注点。我们注意到PASCAL图像平均大小是384×473像素的,因此单尺度设置通常以1.6的倍数对图像进行上采样。因此,RoI池化层的平均有效步长约为10像素。

In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.

在多尺度模型配置中,我们使用[11]中指定的相同的五个尺度(s∈{480,576,688,864,1200}),以方便与SPPnet进行比较。但是,我们限制长边最大为2000像素,以避免GPU内存不足。

Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that single-scale detection performs almost as well as multi-scale detection. Our findings confirm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Yet it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.

Table 7. Multi-scale vs. single scale. SPPnet ZF (similar to model S) results are from [11]. Larger networks with a single-scale offer the best speed / accuracy tradeoff. (L cannot use multi-scale in our implementation due to GPU memory constraints.)

表7显示了当使用一个或五个尺度进行训练和测试时的模型S和M的结果。也许在[11]中最令人惊讶的结果是单尺度检测几乎与多尺度检测一样好。我们的研究结果能证明他们的结果:深度卷积网络擅长直接学习到尺度的不变性。多尺度方法消耗大量的计算时间仅带来了很小的mAP提升(表7)。在VGG16(模型L)的情况下,我们实现细节限制而只能使用单个尺度。然而,它得到了66.9%的mAP,略高于R-CNN[10]的66.0%,尽管R-CNN在某种意义上使用了“无限”尺度,但每个候选区域还是被缩放为规范大小。

7. 多尺度对比单尺度。SPPnet ZF(类似于模型S)的结果来自[11]。具有单尺度的较大网络具有最佳的速度/精度平衡。(L在我们的实现中不能使用多尺度,因为GPU内存限制。)

Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.

由于单尺度处理能够权衡好速度和精度之间的关系,特别是对于非常深的模型,本小节以外的所有实验使用单尺度s=600像素的尺度进行训练和测试。

5.3. Do we need more training data?

A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.

Table 1. VOC 2007 test detection average precision (%). All methods use VGG16. Training set key: 07: VOC07 trainval, 07\diff: 07 without “difficult” examples, 07+12: union of 07 and VOC12 trainval. ySPPnet results were prepared by the authors of [11].

5.3. 我们需要更多训练数据吗?

当提供更多的训练数据时,好的目标检测器应该会进一步提升性能。Zhu等人[24]发现DPM [8]模型的mAP在只有几百到几千个训练样本的时候就达到饱和了。实验中我们增加VOC07 trainval训练集与VOC12 trainval训练集,大约增加到三倍的图像使其数量达到16.5k,以评估Fast R-CNN。扩大训练集将VOC07测试集的mAP从66.9%提高到70.0%(表1)。使用这个数据集进行训练时,我们使用60k次小批量迭代而不是40k。

1. VOC 2007测试检测平均精度(%)。所有方法都使用VGG16。 训练集关键字:07代表VOC07 trainval,07\diff代表07没有“困难”的样本,07 + 12表示VOC07和VOC12 trainval的组合。SPPnet结果由[11]的作者提供。

We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.

我们对VOC2010和2012进行类似的实验,我们用VOC07 trainval、test和VOC12 trainval数据集构造了21.5k个图像的数据集。当用这个数据集训练时,我们使用100k次SGD迭代,并且每40k次迭代(而不是每30k次)降低学习率10倍。对于VOC2010和2012,mAP分别从66.1%提高到68.8%和从65.7%提高到68.4%。

5.4. Do SVMs outperform softmax?

Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in R-CNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.

5.4. SVM分类是否优于Softmax?

Fast R-CNN在fine-tune期间使用softmax分类器学习,而不是像R-CNN和SPPnet在最后训练一对多线性SVM。为了理解这种选择的影响,我们在Fast R-CNN中进行了具有难负采样的事后SVM训练。我们使用与R-CNN中相同的训练算法和超参数。

Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.

Table 8. Fast R-CNN with softmax vs. SVM (VOC07 mAP).

如表8所示,对于所有三个网络,Softmax略优于SVM,mAP分别提高了0.1和0.8个点。这个提升效果很小,但是它表明与先前的多级训练方法相比,“一次性”fine-tune是足够的。我们注意到,不像一对多的SVM那样,Softmax会在计算RoI得分时引入类别之间的竞争。

表8. 用Softmax的Fast R-CNN对比用SVM的Fast RCNN(VOC07 mAP)。

5.5. Are more proposals always better?

There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposal-classifier cascade also improves Fast R-CNN accuracy.

5.5. 更多的候选区域更好吗?

(广义上)存在两种类型的目标检测器:一类是使用候选区域稀疏集合检测器(例如,selective search [21])和另一类使用密集集合(例如DPM [8])。分类稀疏候选区域通过一种级联方式[22]的,其中候选机制首先舍弃大量候选区域,留下较小的集合让分类器来评估。当应用于DPM检测时,这种级联的方式提高了检测精度[21]。我们发现proposal-classifier级联方式也提高了Fast R-CNN的精度。

Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and re-testing model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP.

使用selective search的质量模式,我们对每个图像扫描1k到10k个候选框,每次重新训练和重新测试模型M。如果候选框纯粹扮演计算的角色,增加每个图像的候选框数量不会影响mAP。

We find that mAP rises and then falls slightly as the proposal count increases (Fig. 3, solid blue line). This experiment shows that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.

我们发现随着候选区域数量的增加,mAP先上升然后略微下降(如图3蓝色实线所示)。这个实验表明,深度神经网络分类器使用更多的候选区域没有帮助,甚至稍微有点影响准确性。

This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.

Figure 3. VOC07 test mAP and AR for various proposal schemes.

如果不实际进行实验,这个结果很难预测。用于评估候选区域质量的最流行的技术是平均召回率(Average Recall, AR) [12]。当对每个图像使用固定数量的候选区域时,AR与使用R-CNN的几种候选区域方法时的mAP具有良好的相关性。图3表明AR(红色实线)与mAP不相关,因为每个图像的候选区域数量是变化的。AR必须谨慎使用,由于更多的候选区域会得到更高的AR,然而这并不意味着mAP也会增加。幸运的是,使用模型M的训练和测试需要不到2.5小时。因此,Fast R-CNN能够高效地、直接地评估目标候选区域的mAP,是很合适的代理指标。

图3. 各种候选区域方案下VOC07测试的mAPAR

We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle).

我们还研究了当使用密集生成框(在不同缩放尺度、位置和宽高比上)大约45k个框/图像比例时的Fast R-CNN网络模型。这个密集集足够大,当每个selective search框被其最近(IoU)密集框替换时,mAP只降低1个点(到57.7%,如图3蓝色三角形所示)。

The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000×{2,4,6,8,10,32,45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.

密集框的统计信息与selective search框的统计信息不同。从2k个selective search框开始,我们再从1000×{2,4,6,8,10,32,45}中随机添加密集框,并测试mAP。对于每个实验,我们重新训练和重新测试模型M。当添加这些密集框时,mAP比添加更多选择性搜索框时下降得更强,最终达到53.0%。

We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).

我们还训练和测试了Fast R-CNN只使用密集框(45k/图像)。此设置的mAP为52.9%(蓝色菱形)。最后,我们检查是否需要使用难样本重训练的SVM来处理密集框分布。SVM结果更糟糕:49.3%(蓝色圆圈)。

5.6. Preliminary MS COCO results

We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP, which also averages over IoU thresholds, is 19.7%.

5.6. MS COCO初步结果

我们将Fast R-CNN(使用VGG16)应用于MS COCO数据集[18],以建立初始baseline。我们在80k图像训练集上进行了240k次迭代训练,并使用评估服务器对“test-dev”数据集进行评估。PASCAL形式的mAP为35.9%;新的COCO标准下的AP为19.7%,即超过IoU阈值的平均值。

6. Conclusion

This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-the-art detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.

6. 结论

本文提出Fast R-CNN,一个对R-CNN和SPPnet更新的简洁、快速版本。除了报告目前最先进的检测结果之外,我们还提供了详细的实验,希望提供新的思路。特别值得注意的是,稀疏目标候选区域似乎提高了检测器的质量。过去这个问题代价太大(在时间上)而一直无法深入探索,但Fast R-CNN使其变得可能。当然,可能存在未发现的技术,使得密集框能够达到与稀疏候选框类似的效果。如果这样的方法被开发出来,则可以帮助进一步加速目标检测。

Acknowledgements. I thank Kaiming He, Larry Zitnick, and Piotr Doll´ar for helpful discussions and encouragement.

致谢:感谢Kaiming He,Larry Zitnick和Piotr Dollár的有益的讨论和鼓励。

References

参考文献

[1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV,2012. 5

[2] R. Caruana. Multitask learning. Machine learning, 28(1), 1997. 6

[3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. 5

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 2

[5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014. 4

[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 3

[7] M. Everingham, L. Van Gool, C. K. I.Williams, J.Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 1

[8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 3, 7, 8

[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 4, 8

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Regionbased convolutional networks for accurate object detection and segmentation. TPAMI, 2015. 5, 7, 8

[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2, 3, 4, 5, 6, 7

[12] J. H. Hosang, R. Benenson, P. Doll´ar, and B. Schiele. What makes for effective detection proposals? arXiv preprint arXiv:1502.05082, 2015. 8

[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. of the ACM International Conf. on Multimedia, 2014. 2

[14] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1, 4, 6

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1

[16] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comp., 1989. 1

[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. 5

[18] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: common objects in context. arXiv e-prints, arXiv:1405.0312 [cs.CV], 2014. 8

[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014. 1, 3

[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 5

[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 8

[22] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 8

[23] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, 2013. 4

[24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we need more training data or better models for object detection? In BMVC, 2012. 7

[25] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In CVPR, 2015. 1, 5

  • 9
    点赞
  • 41
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值