fast rcnn

最新推荐文章于 2024-07-12 11:25:47 发布

看不见我呀

最新推荐文章于 2024-07-12 11:25:47 发布

阅读量1.8k

点赞数

分类专栏：论文阅读

本文链接：https://blog.csdn.net/gaotihong/article/details/82432680

版权

论文阅读专栏收录该内容

18 篇文章 0 订阅

订阅专栏

思路：

1.如何构建网络（image和roi输入，roi-pooling，双输出，图1）

2.如何采样（roi，引rcnn）

3.roi-pooling（sppnet）

4.multi-task loss（仅仅对正样本计算loss）

Abstract
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection.
Fast R-CNN builds on previous work to efﬁciently classify object proposals using deep convolutional networks.
Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy.
（fast rcnn主要的目的是提速）
Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN,is 213×faster at test-time, and achieves a higher mAP on PASCAL VOC 2012.
Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10×faster, and is more accurate.
Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.

2.《FastR-CNNarchitectureandtraining》

Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a ﬁxed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI:softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
（fastrcnn架构：输入图像和多个ROI一起输入到全卷机网络中。每一个roi汇集成一个固定尺寸的特征图，然后通过一个全连接层映射层一个特征向量。
在网络中，每一个roi有两个输出向量：softmax的可能性和每一类的bounding box回归偏差。此架构是end-to-end方式进行训练，并采用多任务loss）

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. （fast rcnn输入一整张图像和一堆object proposals）
The network ﬁrst processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. （该网络首先采用多个conv和max pooling层处理整张图像，并产生一个conv feature map）
Then, for each object proposal a region of interest (RoI) pooling layer extracts a ﬁxed-length feature vector from the feature map. （然后，对于每一个object proposal， roi pooling 层提取一个固定长度特征向量从feature map中）
Each feature vector is fed into a sequence of fullyconnected (fc) layers that ﬁnally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 valus encodes reﬁned bounding-box positions for one of the K classes. （每一个特征向量被送入一系列的fc层，最后产生两个兄弟输出层：一个产生softmax概率，此概率包含k个目标类和一个背景类；另一个层输出4个实数，对于k个目标类中的每一个。每4个数一个集合，编码了k类中的一个。）

2.1.TheRoIpoolinglayer （roi pooling layer）
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a ﬁxed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is deﬁned by a four-tuple (r,c,h,w) that speciﬁes its top-left corner (r,c) and its height and width (h,w).
（roi pooling层使用max pooling来转换任何roi特征到一个小的特征值图，此特征图是固定尺寸H*W（如7*7）。此处的H和W与任何一个roi都无关。在本文中，一个roi是一个conv feature map的一个矩形窗口。每一个roi被敬意为4元祖，代表了坐上角和宽和高）
RoI max pooling works by dividing the h×w RoI window into an H × W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell.
（roi max-pooling 划分h*w的ROI窗口到H*W的子窗口，其中子窗口的尺寸近似为h/H ×w/W，然后对每一个子窗口，max-pooling其值到相应的输出网格cell中）
Pooling is applied independently to each feature map channel, as in standard max pooling.
（对于每一个feature map 通道，pooling的应用是独立的，正如标准的max pooling）
The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].
（roi层是一个简化版的sppnet）

2.2.Initializing from pre-trained networks （初始化预训练网络）
We experiment with three pre-trained ImageNet [4] networks, each with ﬁve max pooling layers and between ﬁve and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a FastR-CNN network, it undergoes three transformations. （本文采用三种不同的预训练网络，每一种网络均包含5个pooling 层和5-13个conv层。当每一个预训练网络被初始化为fastrcnn时，需要经过以下三个步骤。）
First, the last max pooling layer is replaced by a RoI pooling layer that is conﬁgured by setting H and W to be compatible with the net’s ﬁrst fully connected layer (e.g., H = W = 7 for VGG16).
（首先，最后一个max pooling层用roi pooling替换。同时设置H和W，应该与网络的第一个全连接层兼容）
Second,the network’s last fullyconnected layer and softmax (which were trained for 1000-way ImageNet classiﬁcation) are replaced with the two sibling layers described earlier(afullyconnectedlayerandsoftmaxover K +1 categories and category-speciﬁc bounding-box regressors).
（然后，网络最后的fc层和softmax层用两个兄弟层替换。）
Third, the network is modiﬁed to take two data inputs: a list of images and a list of RoIs in those images.
（最后，网络应该被调整为使用两个数据输入：一系列的图像和图像上额一列席roi）

2.3.Fine-tuning for detection （微调检测）
Training all network weights with back-propagation is an important capability of Fast R-CNN.
（采用反向传播训练所有的网络权重是一个重要的能力）
First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer. The root cause is that back-propagation through the SPP layer is highly inefﬁcient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefﬁciency
stems from the fact that each RoI may have a very large receptiveﬁeld,often spanning the entire input image. Since the forward pass must process the entire receptive ﬁeld, the training inputs are large (often the entire image).

We propose a more efﬁcient training method that takes advantage of feature sharing during training.
In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, ﬁrst by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images(i.e.,theR-CNNandSPPnetstrategy). One concern over this strategy is it may causes low training convergence because RoIs from the same image are correlated. Thisconcerndoesnotappeartobeapracticalissue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN. In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one ﬁne-tuning stage that jointly optimizes a softmax classiﬁer and bounding-box regressors, rather than training a softmax classiﬁer, SVMs, and regressors in three separate stages [9, 11]. The componentsofthisprocedure(theloss,mini-batchsamplingstrategy,back-propagationthroughRoIpoolinglayers,andSGD hyper-parameters) are described below

Multi-task loss.（多任务损失）
A Fast R-CNN network has two sibling output layers.
The ﬁrst outputs a discrete probability distribution (per RoI), p = (p0,...,pK), over K + 1 categories.
As usual,p is computed by a softmax over the K+1 outputs of a fully connected layer.
（第一个输出为：对于每一个roi都有一个离散的概率分布p = （p0， p1，。。。， pk））。p是通过softmax计算出来的。
The second sibling layer outputs bounding-box regression offsets, tk =(tk x,tk y,tk w,tk h), for each of the K object classes, indexed by k.
（第二个输出为：bounding box回归偏差，对于K个目标的每一个类，都会预测一个回归框）
We use the parameterization for tk given in [9], in which tk speciﬁes a scale-invariant translation and log-space height/width shift relative to an object proposal.（？？？？？）
Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v.
（每一个训练roi都会被标记为一个真值标签u和一个真值回归目标v）
We use a multi-task loss L on each labeled RoI to jointly train for classiﬁcation and bounding-box regression:
（我们对每一个标记的roi，使用一个多任务loss L来联合训练分类和回归）

in which Lcls(p,u) = −logpu is log loss for true class u. （针对真值u哈）
The second task loss, Lloc, is deﬁned over a tuple of true bounding-box regression targets for class u, v = (vx,vy,vw,vh), and a predicted tuple tu = (tu x,tu y,tu w,tu h), again for class u. （回归变量也是针对指针u）
The Iverson bracket indicator function [u ≥ 1] evaluates to 1 when u ≥ 1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. （背景时，设置u= 0 ，此时不计算回归变量，没有真值没有标注。）
For bounding-box regression, we use the loss

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet.
When the regression targets are unbounded, training with L2 loss can require carefult uningof learning rates in order to prevent exploding gradients.
（现在还不懂，l1对离群点不敏感，l2可以帮助收敛，防止梯度发散？？？？？）
Eq. 3 eliminates this sensitivity. The hyper-parameter λ in Eq. 1 controls the balance between the two task losses.
We normalize the ground-truth regression targets vi to have zero mean and unit variance.（如何归一化？？？？？）
All experiments use λ = 1.
We note that [6] uses a related loss to train a class agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separateslocalizationandclassiﬁcation. OverFeat[19], R-CNN [9],andSPPnet[11]alsotrainclassiﬁersandbounding-box localizers, however these methods use stage-wise training, whichweshowissuboptimalforFastR-CNN(Section5.1).

Mini-batch sampling.（采样）
During ﬁne-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. （每一个mini-batch是由2张图像产生的，每张图像64个，共128个）
As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. （从object proposal中，我们获取0.25的roi，这些roi和真值bounding box的iou至少有0.5，这些roi组成正样本，其u大于等于1）
The remaining RoIs are sampled from object proposals that have a maximum IoU with groundtruth intheinterval [0.1,0.5), following [11]. These are the background examples and are labeled with u = 0.
（从objejct proposal中采样的剩下的roi有有最大roi（【0.1,0.5】）标记为负样本，其u =0）
The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8].
During training, images are horizontally ﬂipped with probability 0.5. No other data augmentation is used.
（图像0.5概率进行水平翻转）