faster rcnn翻译

总结:

1.rpn是被单做一个二分类网络来训练的,其输出的roi刚好用于fast rcnn的输入,经过roi  pooling,经过训练,进行目标检测;

2.由于rpn是采用划窗的思想完成的,所以采用卷积完成,后加两个并列的1*1卷积,进行预测roi的位置以及是否有roi

3.问题;conv的每一个像素点预测k个anchor,那最后生成的维度问题,采样的时候,又说是128采样roi???

回答:

(1)利用anchor是从第二列这个位置开始进行处理,这个时候,原始图片已经经过一系列卷积层和池化层以及relu,得到了这里的 feature:51x39x256(256是层数)

在这个特征参数的基础上,通过一个3x3的滑动窗口,在这个51x39的区域上进行滑动,stride=1,padding=2,这样一来,滑动得到的就是51x39个3x3的窗口。

对于每个3x3的窗口,作者就计算这个滑动窗口的中心点所对应的原始图片的中心点。然后作者假定,这个3x3窗口,是从原始图片上通过SPP池化得到的,而这个池化的区域的面积以及比例,就是一个个的anchor。换句话说,对于每个3x3窗口,作者假定它来自9种不同原始区域的池化,但是这些池化在原始图片中的中心点,都完全一样。这个中心点,就是刚才提到的,3x3窗口中心点所对应的原始图片中的中心点。如此一来,在每个窗口位置,我们都可以根据9个不同长宽比例、不同面积的anchor,逆向推导出它所对应的原始图片中的一个区域,这个区域的尺寸以及坐标,都是已知的。而这个区域,就是我们想要的 proposal。所以我们通过滑动窗口和anchor,成功得到了 51x39x9 个原始图片的proposal。接下来,每个proposal我们只输出6个参数:每个 proposal 和 ground truth 进行比较得到的前景概率和背景概率(2个参数)(对应图上的 cls_score);由于每个 proposal 和 ground truth 位置及尺寸上的差异,从 proposal 通过平移放缩得到 ground truth 需要的4个平移放缩参数(对应图上的 bbox_pred)。

所以根据我们刚才的计算,我们一共得到了多少个anchor box呢?

51 x 39 x 9 = 17900

约等于 20 k

(2)

  1. 首先向CNN网络【ZF或VGG-16】输入任意大小图片;

  2. 经过CNN网络前向传播至最后共享的卷积层,一方面得到供RPN网络输入的特征图,另一方面继续前向传播至特有卷积层,产生更高维特征图;

  3. 供RPN网络输入的特征图经过RPN网络得到区域建议和区域得分,并对区域得分采用非极大值抑制【阈值为0.7】,输出其Top-N【文中为300】得分的区域建议给RoI池化层;

  4. 第2步得到的高维特征图和第3步输出的区域建议同时输入RoI池化层,提取对应区域建议的特征;

  5. 第4步得到的区域建议特征通过全连接层后,输出该区域的分类得分以及回归后的bounding-box。

 

《Abstract》
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. 
In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features 
with the detection network, thus enabling nearly cost-free region proposals.
(本文引入了RPN网络,这个网络共享全图像的卷积特征,因此可以很容易的产生reigion proposals)
An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.
 (rpn 是一个全卷积网络,这个网络同时预测目标边界和目标得分(这个应该时是否为目标的得分),在每个位置上)
The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
(其是一种end-to-end的方式,采用fastrcnn进行检测)
We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—
using the recently popular terminology of neural networks with “attention” mechanisms, 
the RPN component tells the unified network where to look. 
(此外,我们将rpn和fastrcnn合并为单一网络,通过共享他们的卷积特征-使用注意机制。rpn告诉整个网络注意那儿)
For the very deep VGG-16 model [3],our detection system has a frame rate of 5fps (including all steps) on a GPU, 
while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets 
with only 300 proposals per image. 
In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available


《3 FASTER R-CNN》

 


Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep
fully convolutional network that proposes regions,and the second module is the Fast R-CNN detector [2]
that uses the proposed regions. 
(本文的目标检测系统,fasterracnn,包含两个模块,第一个模块是全卷积网络,其用来产生region;第二个模块是fast rcnn,其使用
第一个模块产生的regions)

The entire system is a single, unified network for object detection (Figure 2).
Using the recently popular terminology of neural networks with ‘attention’ [31] mechanisms, the RPN
module tells the Fast R-CNN module where to look.
(本文的系统是一个整体的网络,rpn网络告诉fast rcnn看哪儿)
In Section 3.1 we introduce the designs and properties of the network for region proposal.
In Section 3.2 we develop algorithms for training both modules with features shared.

3.1 Region Proposal Networks(RPN)
A Region Proposal Network (RPN) takes an image(of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score.
(RPN网络,输入为任意大小的图像,输出为一堆矩形区域和得分)
We model this process with a fully convolutional network[7], which we describe in this section.
Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], 
we assume that both nets share a common set of convolutional layers. 
In our experiments, we investigate the Zeiler and Fergus model [32] (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model [3] (VGG-16),
which has 13 shareable convolutional layers.
To generate region proposals, we slide a small network over the convolutional feature map output
by the last shared convolutional layer. This small network takes as input an n × n spatial window of
the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature
(256-d for ZF and 512-d for VGG, with ReLU [33]following). 
This feature is fed into two sibling fullyconnected layers—a box-regression layer (reg) and a 
box-classification layer (cls). 
为了产生region,我们滑动小的网络在最后一层卷积特征层上。这个小的网络把输入卷积特征层的n*n的空间窗口当做输入)
(每一个滑动窗口被映射到一个低维的特征上,ZF的256维,或者VGG的512维。这些特征被送入两个兄弟全卷积层,一个为box regression,
另一个为box-classification )

We use n = 3 in this paper, noting that the effective receptive field on the input image is large 
(171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated
at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window
fashion, the fully-connected layers are shared across all spatial locations. 
This architecture is naturally implemented with an n×n convolutional layer followed
by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).
(rpn的架构,n * n 的卷积,后加两个1 * 1的卷积(注意,一个中心点对应一个roi,所以图3中conv feature map经过rpn之后,每一个像素点生成一个256维的列向量,然后在cls和reg分支分别采用1*1 * k的卷积,这样就可以预测每一个conv feature map的每一个像素点输出k个值,分别代表k个box)(????))

3.1.1 Anchors
(在每一个滑动窗口的位置上,我们同时产生多个region,这个region的最大个数为k,此处的k为k个参考盒子)
At each sliding-window location, we simultaneously predict multiple region proposals, where the number
of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding
the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not
object for each proposal4. The k proposals are parameterized relative to k reference boxes, which we call
anchors.
An anchor is centered at the sliding window in question, and is associated with a scale and aspect
ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding
position. For a convolutional feature map of a size W × H (typically ∼2,400), there are W Hk anchors in
total.
(每个anchor是以滑动窗口为中心的,并且和一个尺寸和一个比例关联。文中默认使用3个尺度和3个比例。一个滑动窗口产生9个anchors)
Translation-Invariant Anchors (平移不变性)
An important property of our approach is that it is translation invariant, both in terms of the anchors
and the functions that compute proposals relative to the anchors. 
If one translates an object in an image,the proposal should translate and the same function should be able
to predict the proposal in either location. 
This translation-invariant property is guaranteed by our method5. As a comparison, the MultiBox method [27] 
uses k-means to generate 800 anchors,which are not translation invariant. So MultiBox does not guarantee 
that the same proposal is generated if an object is translated.The translation-invariant property also 
reduces the model size. MultiBox has a (4 + 1) × 800-dimensional fully-connected output layer,
whereas our method has a (4 + 2) × 9-dimensional convolutional output layer in the case of k = 9 anchors. 
As a result, our output layer has 2:8 × 104 parameters (512 × (4 + 2) × 9 for VGG-16), two orders of magnitude 
fewer than MultiBox’s output layer that has 6:1 × 106 parameters(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox
[27]). If considering the feature projection layers, our proposal layers still have an order of magnitude fewer
parameters than MultiBox6. 
We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC 
Multi-Scale Anchors as Regression References(多尺度回归)
Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios).
As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on
image/feature pyramids, e.g., in DPM [8] and CNNbased methods [9], [1], [2]. The images are resized at
multiple scales, and feature maps (HOG [8] or deep convolutional features [9], [1], [2]) are computed for
each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding
windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM [8], models
of different aspect ratios are trained separately using different filter sizes (such as 5×7 and 7×5). If this way
is used to address multiple scales, it can be thought of as a “pyramid of filters” (Figure 1(b)). The second
way is usually adopted jointly with the first way [8]. As a comparison, our anchor-based method is built
on a pyramid of anchors, which is more cost-efficient.Our method classifies and regresses bounding boxes
with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature
maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. 
(有三种解决多尺度的问题。第一种为金字塔法,图像多尺度,然后特征多尺度。第二种为多尺度划窗。用不同尺度的窗口滑动。
此方法也称作金字塔滤波。本文的方法,金字塔anchor,在单一的图像和特征上,使用单一的滤波器)

We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).
Because of this multi-scale design based on anchors,we can simply use the convolutional features computed 
on a single-scale image, as is also done by the Fast R-CNN detector [2]. The design of multiscale anchors 
is a key component for sharing features without extra cost for addressing scales.

3.1.2 Loss Function(代价函数)

(经过rpn提取完样本之后,然后和gt比较,来得到是否为样本。fast rcnn采用2张图像 *64个roi,64/9 = 7,所以conv 的大小为7 *7????nonono,随机采取128roi)
For training RPNs, we assign a binary class label(of being an object or not) to each anchor. 
(训练rpn,把他当做一个二分类的问题)
We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-overUnion (IoU) 
overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. 
Note that a single ground-truth box may assign positive labels to multiple anchors.
(正样本:a,最高的IOU,b,iou > 0.7)
Usually the second condition is sufficient to determine the positive samples; but we still adopt the first
condition for the reason that in some rare cases the second condition may find no positive sample. 
We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes.
(负样本:与所有真值的IOU < 0.3)
Anchors that are neither positive nor negative do not contribute to the training objective.
《非正样本就是负样本的方法,对训练目标没有任何好处》
With these definitions, we minimize an objective function following the multi-task loss in Fast R-CNN
[2]. Our loss function for an image is defined as:


(此处是loss的函数)
Here, i is the index of an anchor in a mini-batch and pi is the predicted probability of anchor i being an
object. The ground-truth label p∗ i is 1 if the anchor is positive, and is 0 if the anchor is negative. 
ti is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t∗ i is that of the
ground-truth box associated with a positive anchor.
The classification loss Lcls is log loss over two classes(object vs. not object).
For the regression loss, we use Lreg(ti; t∗ i ) = R(ti − t∗ i ) where R is the robust loss
function (smooth L1) defined in [2]. The term p∗ i Lreg means the regression loss is activated only for positive
anchors (p∗ i = 1) and is disabled otherwise (p∗ i = 0).
The outputs of the cls and reg layers consist of {pi} and {ti} respectively.
The two terms are normalized by Ncls and Nreg and weighted by a balancing parameter λ. 
In our current implementation (as in the released code), the cls term in Eqn.(1) is normalized by the mini-batch
size (i.e., Ncls = 256) and the reg term is normalized by the number of anchor locations (i.e., Nreg ∼ 2; 400).
By default we set λ = 10, and thus both cls and reg terms are roughly equally weighted. We show
by experiments that the results are insensitive to the values of λ in a wide range (Table 9). We also note
that the normalization as above is not required and could be simplified.
For bounding box regression, we adopt the parameterizations of the 4 coordinates following [5]:


(此处为bounding box的定义的函数)
where x, y, w, and h denote the box’s center coordinates and its width and height. Variables x, xa, and
x∗ are for the predicted box, anchor box, and groundtruth box respectively (likewise for y; w; h). This can
be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
Nevertheless, our method achieves bounding-box regression by a different manner from previous RoIbased 
(Region of Interest) methods [1], [2]. In [1],[2], bounding-box regression is performed on features
pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. 
In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. 
To account for varying sizes, a set of k bounding-box regressors are learned. 
Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share
weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed
size/scale, thanks to the design of anchors.
(用于回归的特征有相同的特征映射。为了考虑多尺寸,一堆k个boundingbox 被学习。每一个boundingbox只负责一个尺寸和一个
比率,k个回归器不共享参数。因此,即使在固定的尺寸的特征上,都可以预测多种不同尺寸的box)

3.1.3 Training RPNs《训练》
The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD)[35]. 
We follow the “image-centric” sampling strategy from [2] to train this network. 
Each mini-batch arises from a single image that contains many positive and negative example anchors. 
(每一个mini-batch是从单张图像中产生,包含许多正样本和负样本)
It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples
as they are dominate.
(使用所有的anchor优化loss函数时可以的,但是会偏向负样本,因为负样本数量占主导位置)
Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch,
where the sampled positive and negative anchors have a ratio of up to 1:1. 
If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.
(本文:从一张图像中随机采样256anchor来计算loss,正负样本为1:1, 如果正样本不够,使用负样本进行补齐)
We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with
standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by 
pretraining a model for ImageNet classification [36], as is standard practice [5]. We tune all layers of the
ZF net, and conv3 1 and up for the VGG net to conserve memory [2].
(对于新的层,采用0均值的高斯分布随机初始化;其余层,采用imagenet进行预训练。
对于zf,微调所有的层;对于VGG,conv3-1以后的层)
 We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. 
 We use a momentum of 0.9 and a weight decay of 0.0005 [37].Our implementation uses Caffe [38].
 (对于60k的minibatch,采用学习率为0.001,对于20k的minibatch,学习率为0.0001;动量为0.9,衰减率为0.0005)
 
 3.2 Sharing Features for RPN and Fast R-CNN
 
Thus far we have described how to train a network for region proposal generation, without considering
the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt
Fast R-CNN [2]. Next we describe algorithms that learn a unified network composed of RPN and Fast
R-CNN with shared convolutional layers (Figure 2).
Both RPN and Fast R-CNN, trained independently,will modify their convolutional layers in different ways. 
We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, 
rather than learning two separate networks. 
(rpn和fastrcnn,单独训练,均采用不同的方式来修改卷积层。
我们需要一种机制来共享两个网络的卷积层,而不是学习两个单独的网络)
We discuss three ways for training networks with features shared:
(i) Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN.
The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the
solution that is used in all experiments in this paper.
(ii) Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one
network during training as in Figure 2. In each SGD iteration, the forward pass generates region proposals 
which are treated just like fixed, pre-computed proposals when training a Fast R-CNN detector. 
The backward propagation takes place as usual, where for the shared layers the backward propagated signals
from both the RPN loss and the Fast R-CNN loss are combined. This solution is easy to implement. But
this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses,
so is approximate. In our experiments, we have empirically found this solver produces close results, yet
reduces the training time by about 25-50% comparing with alternating training. This solver is included in
our released Python code.
(iii) Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are
also functions of the input. The RoI pooling layer[2] in Fast R-CNN accepts the convolutional features
and also the predicted bounding boxes as input, so a theoretically valid backpropagation solver should
also involve gradients w.r.t. the box coordinates. These gradients are ignored in the above approximate joint
training. In a non-approximate joint training solution,we need an RoI pooling layer that is differentiable
w.r.t. the box coordinates. This is a nontrivial problem and a solution can be given by an “RoI warping” layer
as developed in [15], which is beyond the scope of this paper。
4-Step Alternating Training. (4个步骤的轮流训练)
In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. 
In the first step,we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained
model and fine-tuned end-to-end for the region proposal task. 
In the second step, we train a separate detection network by Fast R-CNN using the proposals 
generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. 
At this point the two networks do not share convolutional layers. 
In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional 
layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers. 
Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. 
As such, both networks share the same convolutional layers and form a unified network.
A similar alternating training can be run for more iterations, but we have observed negligible improvements.

3.3 Implementation Details《改善的细节》

We train and test both region proposal and object detection networks on images of a single scale [1], [2].
We re-scale the images such that their shorter side is s = 600 pixels [2]. 
(我们训练和测试region和fastrcnn,均使用单一尺寸的图像。我们重新调整图像,使其短边为600)
Multi-scale feature extraction(using an image pyramid) may improve accuracy but does not exhibit 
a good speed-accuracy trade-off [2].
On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16
pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large
stride provides good results, though accuracy may be further improved with a smaller stride.
(使用小的步长,将会提高准确率)
For anchors, we use 3 scales with box areas of 1282,2562, and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,
and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation
experiments on their effects in the next section.
(比例没有仔细推敲)
 As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales,
saving considerable running time. Figure 3 (right)shows the capability of our method for a wide range of scales 
and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZF net.
We note that our algorithm allows predictions that are larger than the underlying receptive field.
(我们允许预测结果比接受阈大)
Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle
of the object is visible.
(即使只有中间部分可见,该物体都可以预测)
(以下讲需要仔细处理图像边界部分,作者直接删掉了)
The anchor boxes that cross image boundaries need to be handled with care.
During training, we ignore all cross-boundary anchors so they do not contribute to the loss.
For a typical 1000 × 600 image, there will be roughly 20000 (≈ 60 × 40 × 9) anchors in total. 
With the cross-boundary anchors ignored, there are about 6000 anchors per image for training. If the
boundary-crossing outliers are not ignored in training,they introduce large, difficult to correct error terms in
the objective, and training does not converge. During testing, however, we still apply the fully convolutional
RPN to the entire image. This may generate crossboundary proposal boxes, which we clip to the image
boundary。
(对于rpn的proposals高度重合的现象,采用nms基于他们的cls得分。)
Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum
suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS
at 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the
ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the
top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals,
but evaluate different numbers of proposals at test-time。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值