Faster_RCNN(RPN)

最新推荐文章于 2023-08-09 09:39:30 发布

qq_41627642

最新推荐文章于 2023-08-09 09:39:30 发布

阅读量767

点赞数

文章标签：深度学习网络卷积

本文链接：https://blog.csdn.net/qq_41627642/article/details/104987613

版权

Faster_RCNN

Faster RCNN其实可以分为4个主要内容：

1、`Conv layers`

作为一种CNN网络目标检测方法，Faster RCNN首先使用一组基础的conv+relu+pooling层提取image的feature maps。该feature maps被共享用于后续RPN层和全连接层。

2、`Region Proposal Networks`

RPN网络用于生成region proposals。该层通过(cls分类)softmax判断anchors属于foreground（前景）或者background（背景），再利用bounding box regression修正anchors获得精确的proposals。

3、`Roi Pooling`

该层收集输入的feature maps（由1生成）和proposals（由2生成），综合这些信息后提取proposal feature maps，送入后续全连接层判定目标类别。

4、`Classification`

利用proposal feature maps计算proposal的类别，同时再次bounding box regression获得检测框最终的精确位置。
VGG16模型中的faster_rcnn_test.pt的网络结构，可以清晰的看到该网络对于一副任意大小PxQ的图像，首先缩放至固定大小MxN，然后将MxN图像送入网络；而Conv layers中包含了13个conv层+13个relu层+4个pooling层。

1.1 Conv layers

Conv layers包含了conv，pooling，relu三种层。Conv layers部分共有13个conv层，13个relu层，4个pooling层。这里有一个非常容易被忽略但是又无比重要的信息，在Convlayers中：
所有的conv层都是：kernel_size=3，pad=1
在Faster RCNN Conv layers中对所有的卷积都做了扩边处理（pad=1，即填充一圈0），导致原图变为(M+2)x(N+2)大小，再做3x3卷积后输出MxN。正是这种设置，导致Conv layers中的conv层不改变输入和输出矩阵大小。
所有的pooling层都是：kernel_size=2，stride=2
Conv layers中的pooling层kernel_size=2，stride=2。这样每个经过pooling层的MxN矩阵，都会变为(M/2)*(N/2)大小。综上所述，在整个Conv layers中，conv和relu层不改变输入输出大小，只有pooling层使输出长宽都变为输入的1/2。一个MxN大小的矩阵经过Conv layers固定变为**(M/16)x(N/16)**！这样Conv layers生成的featuure map中都可以和原图对应起来。

1.2 RPN网络结构

1.2.1常规RPN

在这里插入图片描述
上图展示了RPN网络的具体结构。可以看到RPN网络实际分为2条线，上面一条通过softmax分类anchors获得foreground和background（检测目标是foreground），下面一条用于计算对于anchors的bounding box regression偏移量，以获得精确的proposal。而最后的Proposal层则负责综合foregroundanchors和bounding box regression偏移量获取proposals，同时剔除太小和超出边界的proposals。其实整个网络到了Proposal Layer这里，就完成了相当于目标定位的功能
在这里插入图片描述
计算每个像素256-d的9个尺度下的值，得到9个anchor，我们给每个anchor分配一个二进制的标签（前景背景）。正标签：anchor与某个ground truth（GT）包围盒有最高的IoU重叠的anchor（大于0.7）；负标签（背景）给与所有GT包围盒的IoU比率都低于0.3的anchor。非正非负的anchor对训练目标没有任何作用，由此输出维度为**（2*9）18**，一共18维。

假设在conv5 feature map中每个点上有k个anchor（默认k=9），而每个anhcor要分foreground和background，所以每个点由256d feature转化为cls=2k scores；而每个anchor都有[x, y, w, h]对应4个偏移量，所以reg=4k coordinates

`Bounding-box regression`

参考博客
在这里插入图片描述

`如何微调`

在这里插入图片描述
对于相差比较小的Proposal我们采用线性回归（寻找平移、缩放系数）

补充一点，全部anchors拿去训练太多了，训练程序会在合适的anchors中随机选取128个postive anchors+128个negative anchors进行训练。

1.2.2 Faster的RPN

1、

将常规RPN获取的预proposal利用feat_stride和im_info将anchors映射回原图，判断预proposal是否大范围超过边界，剔除严重超出边界的。

2、

按照softmax score(一种概率)进行从大到小排序，提取前2000个预proposal，对这个2000个进行NMS(非极大值抑制)，将得到的再次进行排序，输出300个proposal。

3、NMS（非极大值抑制）

由于锚点经常重叠，因此proposal最终也会在同一个目标上重叠。为了解决重复proposa的问题，我们使用一个简单的算法，称为非极大抑制（NMS）。NMS 获取按照分数排序的proposal列表并对已排序的列表进行迭代，丢弃那些 IoU 值大于某个预定义阈值的proposal，并提出一个具有更高分数的proposal。总之，抑制的过程是一个迭代-遍历-消除的过程。如下图所示：

1、将所有候选框的得分进行排序，选中最高分及其所对应的Box；

在这里插入图片描述

2、遍历其余的框，如果它和当前最高得分框的重叠面积大于一定的阈值，我们将其删除。

在这里插入图片描述

3、从没有处理的框中继续选择一个得分最高的，重复上述过程。

`faster-rcnn 之 RPN网络的结构解析`

在这里插入图片描述
1、首先，输入图片大小是 2242243（这个3是三个通道，也就是RGB三种）

2、然后第一层的卷积核维度是 7* 7 * 3*96 （所以大家要认识到卷积核都是4维的，在caffe的矩阵计算中都是这么实现的）；

3、所以conv1得到的结果是110 * 110*96 （这个110来自于 (224-7+pad)/2 +1 ，这个pad是我们常说的填充，也就是在图片的周围补充像素，这样做的目的是为了能够整除，除以2是因为2是图中的stride，这个计算方法在上面建议的文档中有说明与推导的）；

4、然后就是做一次池化，得到pool1， 池化的核的大小是3*3，所以池化后图片的维度是555596 （ (110-3+pad)/2 +1 =55 ）；

5、然后接着就是再一次卷积，这次的卷积核的维度是5596256 ，得到conv2：2626*256；

6、后面就是类似的过程了，我就不详细一步步算了，要注意有些地方除法除不尽，作者是做了填充了，在caffe的prototxt文件中，可以看到每一层的pad的大小；

7、最后作者取的是conv5的输出，也就是1313256送给RPN网络的；
在这里插入图片描述
1、前面我们指出，这个conv feature map的维度是1313256的；

2、sliding window的大小是3*3的，那么如何得到这个256-d的向量呢？这个很简单了，我们只需要一个3 * 3 * 256 * 256这样的一个4维的卷积核，就可以将每一个3 *3的sliding window 卷积成一个256维的向量；

这里读者要注意啊，作者这里画的示意图仅仅是针对一个sliding window的；在实际实现中，我们有很多个sliding window，所以得到的并不是一维的256-d向量，实际上还是一个3维的矩阵数据结构；可能写成for循环做sliding window大家会比较清楚，当用矩阵运算的时候，会稍微绕些；

3、然后就是k=9(anchors)，所以cls layer就是k*2=18个输出节点了，那么在256-d和cls layer之间使用一个11256*18的卷积核，就可以得到cls layer，当然这个11256*18的卷积核就是大家平常理解的全连接；所以全连接只是卷积操作的一种特殊情况（当卷积核的大小与图片大小相同的时候，其实所谓的卷积就是全连接了）；

4、reg layer也是一样了，reg layer的输出是94=36个，所以对应的卷积核是1125636，这样就可以得到reg layer的输出了；

5、然后cls layer 和reg layer后面都会接到自己的损失函数上，给出损失函数的值，同时会根据求导的结果，给出反向传播的数据。

name: "ZF"  
layer {  
  name: 'input-data' #这一层就是最开始数据输入  
  type: 'Python'  
  top: 'data' # top表示该层的输出，所以可以看到这一层输出三组数据，data，真值框gt_boxes，和相关信息im_info  
  top: 'im_info' # 这些都是存储在矩阵中的  
  top: 'gt_boxes'  
  python_param {  
    module: 'roi_data_layer.layer'  
    layer: 'RoIDataLayer'  
    param_str: "'num_classes': 21"  
  }  
}  
  
#========= conv1-conv5 ============  
  
layer {  
    name: "conv1"  #每一层的名称
    type: "Convolution"  #卷积操作
    bottom: "data" # 输入数据data  
    top: "conv1" # 输出数据conv1，这里conv1就代表了这一层输出数据的名称，存储在对应的矩阵中  
    param { lr_mult: 1.0 }  
    param { lr_mult: 2.0 }  
    convolution_param {  
        num_output: 96  #卷积核的个数
        kernel_size: 7  #卷积核的大小
        pad: 3  # 这里可以看到卷积1层 填充了3个像素  
        stride: 2  
    }  
}  
layer {  
    name: "relu1"  #激活函数
    type: "ReLU"  
    bottom: "conv1"  
    top: "conv1"  
}  
layer {  
    name: "norm1"  
    type: "LRN"  
    bottom: "conv1"  
    top: "norm1" # 做归一化操作，通俗点说就是做个除法  
    lrn_param {  
        local_size: 3  
        alpha: 0.00005  
        beta: 0.75  
        norm_region: WITHIN_CHANNEL  
    engine: CAFFE  
    }  
}  
layer {  
    name: "pool1"  
    type: "Pooling"  
    bottom: "norm1"  
    top: "pool1"  
    pooling_param {  
        kernel_size: 3  
        stride: 2  
        pad: 1 # 池化的时候，又做了填充  
        pool: MAX  
    }  
}  
layer {  
    name: "conv2"  
    type: "Convolution"  
    bottom: "pool1"  
    top: "conv2"  
    param { lr_mult: 1.0 }  
    param { lr_mult: 2.0 }  
    convolution_param {  
        num_output: 256  
        kernel_size: 5  
        pad: 2  
        stride: 2  
    }  
}  
layer {  
    name: "relu2"  
    type: "ReLU"  
    bottom: "conv2"  
    top: "conv2"  
}  
layer {  
    name: "norm2"  
    type: "LRN"  
    bottom: "conv2"  
    top: "norm2"  
    lrn_param {  
        local_size: 3  
        alpha: 0.00005  
        beta: 0.75  
        norm_region: WITHIN_CHANNEL  
    engine: CAFFE  
    }  
}  
layer {  
    name: "pool2"  
    type: "Pooling"  
    bottom: "norm2"  
    top: "pool2"  
    pooling_param {  
        kernel_size: 3  
        stride: 2  
        pad: 1  
        pool: MAX  
    }  
}  
layer {  
    name: "conv3"  
    type: "Convolution"  
    bottom: "pool2"  
    top: "conv3"  
    param { lr_mult: 1.0 }  
    param { lr_mult: 2.0 }  
    convolution_param {  
        num_output: 384  
        kernel_size: 3  
        pad: 1  
        stride: 1  
    }  
}  
layer {  
    name: "relu3"  
    type: "ReLU"  
    bottom: "conv3"  
    top: "conv3"  
}  
layer {  
    name: "conv4"  
    type: "Convolution"  
    bottom: "conv3"  
    top: "conv4"  
    param { lr_mult: 1.0 }  
    param { lr_mult: 2.0 }  
    convolution_param {  
        num_output: 384  
        kernel_size: 3  
        pad: 1  
        stride: 1  
    }  
}  
layer {  
    name: "relu4"  
    type: "ReLU"  
    bottom: "conv4"  
    top: "conv4"  
}  
layer {  
    name: "conv5"  
    type: "Convolution"  
    bottom: "conv4"  
    top: "conv5"  
    param { lr_mult: 1.0 }  
    param { lr_mult: 2.0 }  
    convolution_param {  
        num_output: 256  
        kernel_size: 3  
        pad: 1  
        stride: 1  
    }  
}  
layer {  
    name: "relu5"  
    type: "ReLU"  
    bottom: "conv5"  
    top: "conv5"  
}  
  
#========= RPN ============  
# 到我们的RPN网络部分了，前面的都是共享的5层卷积层的部分  
layer {  
  name: "rpn_conv1"  #prn卷积操作
  type: "Convolution"  
  bottom: "conv5"  
  top: "rpn_conv1"  
  param { lr_mult: 1.0 }  
  param { lr_mult: 2.0 }  
  convolution_param {  
    num_output: 256  
    kernel_size: 3 pad: 1 stride: 1 #**这里作者把每个滑窗3*3，通过3*3*256*256的卷积核输出256维，完整的输出其实是12*12*256,**  
    weight_filler { type: "gaussian" std: 0.01 }  
    bias_filler { type: "constant" value: 0 }  
  }  
}  
layer {  
  name: "rpn_relu1"  
  type: "ReLU"  
  bottom: "rpn_conv1"  
  top: "rpn_conv1"  
}  
layer {  
  name: "rpn_cls_score"  
  type: "Convolution"  
  bottom: "rpn_conv1"  
  top: "rpn_cls_score"  
  param { lr_mult: 1.0 }  
  param { lr_mult: 2.0 }  
  convolution_param {  
    num_output: 18   # 2(bg/fg) * 9(anchors)  
    kernel_size: 1 pad: 0 stride: 1 #这里看的很清楚，作者通过1*1*256*18的卷积核，将前面的256维数据转换成了18个输出  
    weight_filler { type: "gaussian" std: 0.01 }  
    bias_filler { type: "constant" value: 0 }  
  }  
}  
layer {  
  name: "rpn_bbox_pred"  
  type: "Convolution"  
  bottom: "rpn_conv1"  
  top: "rpn_bbox_pred"  
  param { lr_mult: 1.0 }  
  param { lr_mult: 2.0 }  
  convolution_param {  
    num_output: 36   # 4 * 9(anchors)  
    kernel_size: 1 pad: 0 stride: 1 <span style="font-family: Arial, Helvetica, sans-serif;">#这里看的很清楚，作者通过1*1*256*36的卷积核，将前面的256维数据转换成了36个输出</span>  
    weight_filler { type: "gaussian" std: 0.01 }  
    bias_filler { type: "constant" value: 0 }  
  }  
}  
layer {  
   bottom: "rpn_cls_score"  
   top: "rpn_cls_score_reshape" # 我们之前说过，其实这一层是12*12*256的，所以后面我们要送给损失函数，需要将这个矩阵reshape一下，我们需要的是144个滑窗，每个对应的256的向量  
   name: "rpn_cls_score_reshape"  
   type: "Reshape"  
   reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 } }  
}  
layer {  
  name: 'rpn-data'  
  type: 'Python'  
  bottom: 'rpn_cls_score'  
  bottom: 'gt_boxes'  
  bottom: 'im_info'  
  bottom: 'data'  
  top: 'rpn_labels'  
  top: 'rpn_bbox_targets'  
  top: 'rpn_bbox_inside_weights'  
  top: 'rpn_bbox_outside_weights'  
  python_param {  
    module: 'rpn.anchor_target_layer'  
    layer: 'AnchorTargetLayer'  
    param_str: "'feat_stride': 16"  
  }  
}  
layer {  
  name: "rpn_loss_cls"  
  type: "SoftmaxWithLoss" # 很明显这里是计算softmax的损失，输入labels和cls layer的18个输出（中间reshape了一下），输出损失函数的具体值  
  bottom: "rpn_cls_score_reshape"  
  bottom: "rpn_labels"  
  propagate_down: 1  
  propagate_down: 0  
  top: "rpn_cls_loss"  
  loss_weight: 1  
  loss_param {  
    ignore_label: -1  
    normalize: true  
  }  
}  
layer {  
  name: "rpn_loss_bbox"  
  type: "SmoothL1Loss" # 这里计算的框回归损失函数具体的值  
  bottom: "rpn_bbox_pred"  
  bottom: "rpn_bbox_targets"  
  bottom: "rpn_bbox_inside_weights"  
  bottom: "rpn_bbox_outside_weights"  
  top: "rpn_loss_bbox"  
  loss_weight: 1  
  smooth_l1_loss_param { sigma: 3.0 }  
}  
  
#========= RCNN ============  
# Dummy layers so that initial parameters are saved into the output net  
  
layer {  
  name: "dummy_roi_pool_conv5"  
  type: "DummyData"  
  top: "dummy_roi_pool_conv5"  
  dummy_data_param {  
    shape { dim: 1 dim: 9216 }  
    data_filler { type: "gaussian" std: 0.01 }  
  }  
}  
layer {  
  name: "fc6"  
  type: "InnerProduct"  
  bottom: "dummy_roi_pool_conv5"  
  top: "fc6"  
  param { lr_mult: 0 decay_mult: 0 }  
  param { lr_mult: 0 decay_mult: 0 }  
  inner_product_param {  
    num_output: 4096  
  }  
}  
layer {  
  name: "relu6"  
  type: "ReLU"  
  bottom: "fc6"  
  top: "fc6"  
}  
layer {  
  name: "fc7"  
  type: "InnerProduct"  
  bottom: "fc6"  
  top: "fc7"  
  param { lr_mult: 0 decay_mult: 0 }  
  param { lr_mult: 0 decay_mult: 0 }  
  inner_product_param {  
    num_output: 4096  
  }  
}  
layer {  
  name: "silence_fc7"  
  type: "Silence"  
  bottom: "fc7"

在这里插入图片描述

RPN代码解读

参考博客

首先，来看一下Faster RCNN中RPN的结构是什么样子的吧。可以看到RPN直接通过一个卷积层rpn_conv/3x3直接接在了分类网络的特征层输出上面，之后接上两个卷积层rpn_clc_score与rpn_bbox_pred分别用于产生前景背景分类与预测框。之后再由python层AnchorTargetLayer产生anchor机制的分类与预测框。然后，经过ROI Proposal产生ROI区域的候选，并通过ROI Pooling规范到相同的尺寸上进行后续处理。大体的结构如下图所示：
在这里插入图片描述

rpn代码

虽然在上面的图中能够对RPN网络有一个比较直观但是笼统的概念，其具体内部搞了啥子，并不清楚。所以还是撸一下它里面的代码看看吧，首先来看RPN模块中各个文件说明。
（1）generate_anchors.py
在[0,0,15,15]基础anchor的基础上生成不同宽高比例以及缩放大小的anchor。

# --------------------------------------------------------
# Faster R-CNN
import numpy as np

# Verify that we compute the same anchors as Shaoqing's matlab implementation:
#
#    >> load output/rpn_cachedir/faster_rcnn_VOC2007_ZF_stage1_rpn/anchors.mat
#    >> anchors
#
#    anchors =
#
#       -83   -39   100    56
#      -175   -87   192   104
#      -359  -183   376   200
#       -55   -55    72    72
#      -119  -119   136   136
#      -247  -247   264   264
#       -35   -79    52    96
#       -79  -167    96   184
#      -167  -343   184   360
'''
generate_anchors.py
在[0,0,15,15]基础anchor的基础上生成不同宽高比例以及缩放大小的anchor。
'''
def generate_anchors(base_size=16, ratios=[0.5, 1, 2],
                     scales=2**np.arange(3, 6)):
    """
    Generate anchor (reference) windows by enumerating aspect ratios X
    scales wrt a reference (0, 0, 15, 15) window.
    通过枚举高宽比X生成锚(参考)窗口
    缩放wt引用(0,0,15,15)窗口,(0,0)左上角的坐标，（15，15）右上角的坐标
    base_size是base_anchor的尺寸
    ratios是不同宽高比例[0.5,1,2]
    scales是缩放大小[8,16,32]
    """
    base_anchor = np.array([1, 1, base_size, base_size]) - 1 #首先生成一个基础锚
    ratio_anchors = _ratio_enum(base_anchor, ratios)
    anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)
                         for i in range(ratio_anchors.shape[0])])#在不同高宽比例下生成不同缩放下的anchors
    return anchors    

def _whctrs(anchor):
    """
    Return width, height, x center, and y center for an anchor (window).
    返回锚点(窗口)的宽度、高度、x中心和y中心
    """
    w = anchor[2] - anchor[0] + 1  
    h = anchor[3] - anchor[1] + 1
    x_ctr = anchor[0] + 0.5 * (w - 1) #anchor的中心点坐标
    y_ctr = anchor[1] + 0.5 * (h - 1)
    return w, h, x_ctr, y_ctr

def _mkanchors(ws, hs, x_ctr, y_ctr):
    """
    Given a vector of widths (ws) and heights (hs) around a center
    (x_ctr, y_ctr), output a set of anchors (windows).
    给定围绕中心的宽度(ws)和高度(hs)向量
    (x_ctr, y_ctr)，输出一组锚点(窗口)
    """
    ws = ws[:, np.newaxis]
    hs = hs[:, np.newaxis]
    anchors = np.hstack((x_ctr - 0.5 * (ws - 1),
                         y_ctr - 0.5 * (hs - 1),
                         x_ctr + 0.5 * (ws - 1),
                         y_ctr + 0.5 * (hs - 1)))   #返回左上角，右下角的坐标
    return anchors
    
def _ratio_enum(anchor, ratios):
    """
    Enumerate a set of anchors for each aspect ratio wrt an anchor.
    为每个长宽比列出一组锚点
    """
    w, h, x_ctr, y_ctr = _whctrs(anchor)  #计算基础锚的中心点坐标和高宽
    size = w * h  #基础anchor的面积256
    size_ratios = size / ratios# 不同高宽比例下的面积 [512,256,128]
    ws = np.round(np.sqrt(size_ratios)) #面积开平方然后约分
    hs = np.round(ws * ratios)
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)#不同高宽尺度下返回的anchors
    return anchors

def _scale_enum(anchor, scales):
    """
    Enumerate a set of anchors for each scale wrt an anchor.
    为每个标度列出一组锚点
    """
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    ws = w * scales
    hs = h * scales
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)#返回不同缩放比例下的anchors
    return anchors
if __name__ == '__main__':
    import time
    t = time.time()
    a = generate_anchors()
    print (time.time() - t)
    print (a)
    from IPython import embed; embed()

（2）proposal_layer.py
将RPN网络的每个anchor的分类得分以及检测框回归预估转换为目标候选
该层有3个输入：fg/bg anchors分类器结果rpn_cls_prob_reshape，对应的bbox reg的[dx(A)，dy(A)，dw(A)，dh(A)]变换量rpn_bbox_pred，以及im_info；另外还有参数feat_stride=16。缩进首先解释im_info。对于一副任意大小图像，传入Faster RCNN前首先reshape到固定M∗N，im_info=[M,N,scale_factor]则保存了此次缩放的所有信息。然后经过Conv Layers，经过4次pooling变为(M/16)∗(N/16)大小，feature_stride=16则保存了该信息。所有这些数值都是为了将proposal映射回原图而设置的。首先来看，该层的初始函数

def setup(self, bottom, top):
    # parse the layer parameter string, which must be valid YAML
    layer_params = yaml.load(self.param_str_)
    self._feat_stride = layer_params['feat_stride']
    anchor_scales = layer_params.get('scales', (8, 16, 32))
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 产生默认的9个anchor
    self._num_anchors = self._anchors.shape[0]

    if DEBUG:
        print 'feat_stride: {}'.format(self._feat_stride)
        print 'anchors:'
        print self._anchors

    # rois blob: holds R regions of interest, each is a 5-tuple
    # (n, x1, y1, x2, y2) specifying an image batch **index n** and a
    # rectangle (x1, y1, x2, y2)
    top[0].reshape(1, 5)

    # scores blob: holds scores for R regions of interest
    if len(top) > 1:
        top[1].reshape(1, 1, 1, 1)

在进行前向运算之前，需要载入一些配置项：

cfg_key = str(self.phase) # either 'TRAIN' or 'TEST' 阶段为train和test的时候nms的输入输出数目不一样
# Number of top scoring boxes to keep before apply NMS to RPN proposals
# 对RPN结果使用NMS之前需要保留的框
pre_nms_topN  = cfg[cfg_key].RPN_PRE_NMS_TOP_N # 12000
# Number of top scoring boxes to keep after applying NMS to RPN proposals
# 对RPN结果使用NMS之后需要保留的框
post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N # 1200
## NMS threshold used on RPN proposals 使用nms时候的阈值
nms_thresh    = cfg[cfg_key].RPN_NMS_THRESH # 0.7
# Proposal height and width both need to be greater than RPN_MIN_SIZE (at orig image scale)
min_size      = cfg[cfg_key].RPN_MIN_SIZE # 16

# the first set of _num_anchors channels are bg probs
# the second set are the fg probs, which we want
# 前9个通道为背景类；后9个通道为非背景类
scores = bottom[0].data[:, self._num_anchors:, :, :] # 预测的分类（卷积输出：18）
bbox_deltas = bottom[1].data # 预测框的偏移量
im_info = bottom[2].data[0, :] # 图像的信息

接下来就开始proposal了
step1：再次生成anchor，并使用bbox_deltas得到预测框

# 1. Generate proposals from bbox deltas and shifted anchors
height, width = scores.shape[-2:]
if DEBUG:
    print 'score map size: {}'.format(scores.shape)
# Enumerate all shifts 这部分同anchor_target_layer
shift_x = np.arange(0, width) * self._feat_stride
shift_y = np.arange(0, height) * self._feat_stride
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose()

# Enumerate all shifted anchors:
#
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0]
anchors = self._anchors.reshape((1, A, 4)) + \
                  shifts.reshape((1, K, 4)).transpose((1, 0, 2))
anchors = anchors.reshape((K * A, 4))
# Transpose and reshape predicted bbox transformations to get them
# into the same order as the anchors:
#
# bbox deltas will be (1, 4 * A, H, W) format
# transpose to (1, H, W, 4 * A)
# reshape to (1 * H * W * A, 4) where rows are ordered by (h, w, a)
# in slowest to fastest order
bbox_deltas = bbox_deltas.transpose((0, 2, 3, 1)).reshape((-1, 4))
# Same story for the scores:
#
# scores are (1, A, H, W) format
# transpose to (1, H, W, A)
# reshape to (1 * H * W * A, 1) where rows are ordered by (h, w, a)
scores = scores.transpose((0, 2, 3, 1)).reshape((-1, 1))
# Convert anchors into proposals via bbox transformations
# 利用 bbox_deltas 对anchors进行修正，得到proposals的预测位置，可以参考论文中公式
# 对于x,y使用线性变换，对于w,h使用exp
proposals = bbox_transform_inv(anchors, bbox_deltas)

step2：剪裁预测框使之在图像范围之内

# 2. clip predicted boxes to image
# 剪裁预测框到图像的边界内
proposals = clip_boxes(proposals, im_info[:2])

step3：去除小的预测框，阈值为16

# 3. remove predicted boxes with either height or width < threshold
# (NOTE: convert min_size to input image scale stored in im_info[2])
# 去除长宽小于16的预测框，因为进行过4次Pooling呀
keep = _filter_boxes(proposals, min_size * im_info[2])
proposals = proposals[keep, :]
scores = scores[keep]

step4：对于预测框的分数进行排序，并且取前N个送去NMS

# 4. sort all (proposal, score) pairs by score from highest to lowest
# 5. take top pre_nms_topN (e.g. 6000) 选出Top_N，后面再进行 NMS，见前面的设置
order = scores.ravel().argsort()[::-1]
if pre_nms_topN > 0:
    order = order[:pre_nms_topN]
proposals = proposals[order, :] # 保留了前pre_nms_topN个框的坐标信息
scores = scores[order] # 保留了前pre_nms_topN个框的分数信息

step5：进行NMS并取前N个

# 6. apply nms (e.g. threshold = 0.7)
# 7. take after_nms_topN (e.g. 300)
# 8. return the top proposals (-> RoIs top) 对预测框进行nms
keep = nms(np.hstack((proposals, scores)), nms_thresh)
if post_nms_topN > 0:
    keep = keep[:post_nms_topN]
proposals = proposals[keep, :] # 对nms之后的预测框取前after_nms_topN个
scores = scores[keep]

step6：输出结果

# Output rois blob
# Our RPN implementation only supports a single input image, so all
# batch inds are 0
batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))
top[0].reshape(*(blob.shape))
top[0].data[...] = blob
# [Optional] output scores blob
if len(top) > 1:
    top[1].reshape(*(scores.shape))
    top[1].data[...] = scores

（3）anchor_target_layer.py
为每个anchor生成训练目标或标签，分类的标签只是0（非目标）1（是目标）-1（忽略）。当分类的标签大于0的时候预测框的回归才被指定。这个部分使用到的文件有anchor_target_layer.py、generate_anchors.py。首先，来看看这个层的初始化函数：

def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    anchor_scales = layer_params.get('scales', (8, 16, 32)) # 尺度(高宽比)变化参数
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 生成默认的9个anchor
    self._num_anchors = self._anchors.shape[0] #anchor的数目9
    self._feat_stride = layer_params['feat_stride'] #cnn完以后，特征图相对于原图的缩放比，16

    # allow boxes to sit over the edge by a small amount
	# 设为0，则取出任何超过图像边界的proposals，只要超出一点点，都要去除
    self._allowed_border = layer_params.get('allowed_border', 0)

    height, width = bottom[0].data.shape[-2:]
    if DEBUG:
        print 'AnchorTargetLayer: height', height, 'width', width

        A = self._num_anchors
    # labels 是否为目标的分类
    top[0].reshape(1, 1, A * height, width)
    # bbox_targets，dx、dy、dw、dh
    top[1].reshape(1, A * 4, height, width)
    # bbox_inside_weights p*
    top[2].reshape(1, A * 4, height, width)
    # bbox_outside_weights   Nreg=1/256
top[3].reshape(1, A * 4, height, width)

接下来就是重头的forward函数，首先，该函数在特征图生成需要运算的总的anchor（每个像素点有9个anchor）

# 1. Generate proposals from bbox deltas and shifted anchors
# x方向的偏移个数，大小为特征图的width
shift_x = np.arange(0, width) * self._feat_stride
# y方向的偏移个数，大小为特征图的height
shift_y = np.arange(0, height) * self._feat_stride
# shift_x，shift_y均为width×height的二维数组（meshgrid生成），对应位置的元素组合即构成图像上需要偏移量大小
#（偏移量大小是相对与图像最左上角的那9个anchor的偏移量大小），也就是说总共会得到width×height×9个偏移值对。
# 这些偏移值对与初始的anchor相加即可得到
# 所有的anchors，所以总共会产生width×height×9个anchors，且存储在all_anchors变量中
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose() # 维度输出为((width*height),4)
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0] # K=width*height
# 在之前9个anchor的基础上产生K*A个anchor，既是总的anchor数量
all_anchors = (self._anchors.reshape((1, A, 4)) +
               shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
all_anchors = all_anchors.reshape((K * A, 4))
total_anchors = int(K * A) # 总的anchor数量

产生这么多(K*A)的anchor自然有一些超出了边界，那么就需要对其进行剔除

# only keep anchors inside the image 在图像内部的anchor，即是有效anchor，边界之外的删除掉
#all_anchors[:, 0],中心点x,all_anchors[:, 1],中心点y,all_anchors[:, 2],w,all_anchors[:, 3],h
inds_inside = np.where(
    (all_anchors[:, 0] >= -self._allowed_border) &
    (all_anchors[:, 1] >= -self._allowed_border) &
    (all_anchors[:, 2] < im_info[1] + self._allowed_border) &  # width
    (all_anchors[:, 3] < im_info[0] + self._allowed_border)    # height
    )[0]

初始化可用anchor对应的lable，分类标签的含义下面写了

# label: 1 is positive, 0 is negative, -1 is dont care
# 图像内部anchor对应的分类，是否为目标的分类，大小为符合条件anchor的数量
labels = np.empty((len(inds_inside), ), dtype=np.float32)
labels.fill(-1)  #填充-1值

在之前生成了计算需要的anchor了那么接下来就是需要计算anchor与gt之间的关系了，也就是使用overlap area的面积来度量，每个anchor的是否为目标分类也是根据这个度量来设置的。

# overlaps between the anchors and the gt boxes
# overlaps (ex, gt)返回维度为【anchors * gt_boxes】大小的二维数组
overlaps = bbox_overlaps(
    np.ascontiguousarray(anchors, dtype=np.float),
    np.ascontiguousarray(gt_boxes, dtype=np.float))
argmax_overlaps = overlaps.argmax(axis=1) # 求取于anchor重叠最大的gt的index
max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps] # 取出与每个anchor重叠最大gt的重叠面积
gt_argmax_overlaps = overlaps.argmax(axis=0) # 求出与每个gt重叠面积最大的anchor
gt_max_overlaps = overlaps[gt_argmax_overlaps,
                                   np.arange(overlaps.shape[1])] # 取出与每个gt重叠面积最大的
gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

# 重叠面积小于阈值0.3的标注为0
if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:
    # assign bg labels first so that positive labels can clobber them
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

# fg label: for each gt, anchor with highest overlap 与gt图重叠最大的对应anchor分类被设置为1
labels[gt_argmax_overlaps] = 1

# fg label: above threshold IOU 将与gt重叠的面积大于阈值0.7的anchor也将其分类设置为1
labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1

if cfg.TRAIN.RPN_CLOBBER_POSITIVES:
    # assign bg labels last so that negative labels can clobber positives
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

论文中说从所有anchor中随机选取256个anchor，前景128个，背景128个。注意：那种label为-1的不会当前景也不会当背景。下面这两段代码是前一部分是在所有前景的anchor中选128个，后一部分是在所有的背景anchor中选128个。如果前景的个数少于了128个，就把所有的anchor选出来，差的由背景部分补。这和Fast RCNN选取ROI一样。

# subsample positive labels if we have too many 要是运行到这里得到的分类为1的太多了那就进行采样
# 从所有label为1的anchor中选择128个，剩下的anchor的label全部置为-1
#cfg.TRAIN.RPN_FG_FRACTION=0.5 , cfg.TRAIN.RPN_BATCHSIZE=256
num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE) # 采样的阈值128
fg_inds = np.where(labels == 1)[0]
if len(fg_inds) > num_fg:
    disable_inds = npr.choice(
        fg_inds, size=(len(fg_inds) - num_fg), replace=False)
    labels[disable_inds] = -1
# subsample negative labels if we have too many 要是被分类为非1的太多了那么也要进行采样
# 这里num_bg不是直接设为128，而是256减去label为1的个数，这样如果label为1的不够，就用label为0的填充，这个代码实现很巧
num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)
bg_inds = np.where(labels == 0)[0]
if len(bg_inds) > num_bg:
    disable_inds = npr.choice(
        bg_inds, size=(len(bg_inds) - num_bg), replace=False)
    labels[disable_inds] = -1

在这里插入图片描述

bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32) # 之前anchor过滤之后与之对应的bbox
bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :]) # 计算anchor框与gt框之间的残差用于回归,输出为(targets_dx, targets_dy, targets_dw, targets_dh）

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
#_C.TRAIN.RPN_BBOX_INSIDE_WEIGHTS = (1.0, 1.0, 1.0, 1.0)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
# 对样本权重进行归一化
# Give the positive RPN examples weight of p * 1 / {num positives}
# and give negatives a weight of (1 - p)
# Set to -1.0 to use uniform example weighting
# __C.TRAIN.RPN_POSITIVE_WEIGHT = -1.0
if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0:
    # uniform weighting of examples (given non-uniform sampling)
    num_examples = np.sum(labels >= 0)
    positive_weights = np.ones((1, 4)) * 1.0 / num_examples
    negative_weights = np.ones((1, 4)) * 1.0 / num_examples
else:
    assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &
            (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))
    positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT /
                        np.sum(labels == 1))
    negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) /
                                np.sum(labels == 0))
bbox_outside_weights[labels == 1, :] = positive_weights
bbox_outside_weights[labels == 0, :] = negative_weights

之后将计算的anchor映射回原来的全部的anchor中去

# map up to original set of anchors
# 主要是将长度为len(inds_inside)的数据映射回长度total_anchors的数据，total_anchors=(width*height)×9
labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)

在这里插入图片描述

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

最后就是维度转换并设置这个层的4个输出了

# labels
labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
labels = labels.reshape((1, 1, A * height, width))
top[0].reshape(*labels.shape)
top[0].data[...] = labels

# bbox_targets
bbox_targets = bbox_targets \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
top[1].reshape(*bbox_targets.shape)
top[1].data[...] = bbox_targets

# bbox_inside_weights
bbox_inside_weights = bbox_inside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_inside_weights.shape[2] == height
assert bbox_inside_weights.shape[3] == width
top[2].reshape(*bbox_inside_weights.shape)
top[2].data[...] = bbox_inside_weights

# bbox_outside_weights
bbox_outside_weights = bbox_outside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_outside_weights.shape[2] == height
assert bbox_outside_weights.shape[3] == width
top[3].reshape(*bbox_outside_weights.shape)
 top[3].data[...] = bbox_outside_weights

到这里，由特征图与anchor生成anchor分类与预测框的流程梳理完了，接下来就是根据对该层输出计算RPN部分的loss了。PS：我们注意到，该层中没有并没有实现反向传播，这是为毛啊？没有给网络提供梯度。其实是因为这个层的输入信息rpn_cls_score就提供了一个长宽信息就回家洗洗睡了，所以就没必要传递梯度了。
（4）proposal_target_layer.py
为每个目标候选生成训练目标或标签，分类标签从0−K0-K0−K（背景0或目标类别1,…,K1, \dots, K1,…,K），自然lable值大于0的才被指定预测框回归。
这个层主要完成由RPN得到的预测框到对应分类的匹配，其中对每次训练的预测框进行了限制（每次只处理32个目标预测框，总数的1/4），详见_sample_rois函数。首先，得到分类的数目，并初始化输出blob的shape

def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    self._num_classes = layer_params['num_classes']

    # sampled rois (0, x1, y1, x2, y2)
    top[0].reshape(1, 5)
    # labels
    top[1].reshape(1, 1)
    # bbox_targets
    top[2].reshape(1, self._num_classes * 4)
    # bbox_inside_weights
    top[3].reshape(1, self._num_classes * 4)
    # bbox_outside_weights
    top[4].reshape(1, self._num_classes * 4)

前向传播函数

def forward(self, bottom, top):
    # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
    # (i.e., rpn.proposal_layer.ProposalLayer), or any other source
    all_rois = bottom[0].data # RPN预测框，维度为[N,5]
    # GT boxes (x1, y1, x2, y2, label)
    # TODO(rbg): it's annoying that sometimes I have extra info before
    # and other times after box coordinates -- normalize to one format
    gt_boxes = bottom[1].data # GT信息，维度[M,5]

    # Include ground-truth boxes in the set of candidate rois
    # 将ground truth框加入到待分类的框里面(相当于增加正样本个数)
    # all_rois输出维度[N+M,5]，前一维表示是从RPN的输出选出的框和ground truth框合在一起了
    zeros = np.zeros((gt_boxes.shape[0], 1), dtype=gt_boxes.dtype)
    all_rois = np.vstack(
        (all_rois, np.hstack((zeros, gt_boxes[:, :-1])))
    ) # 先在每个ground truth框前面插入0(这样才能和N个从RPN的输出选出的框对齐)，然后把ground truth框插在最后

    # Sanity check: single batch only
    assert np.all(all_rois[:, 0] == 0), \
        'Only single item batches are supported'

    num_images = 1
    rois_per_image = cfg.TRAIN.BATCH_SIZE / num_images #cfg.TRAIN.BATCH_SIZE为128
    # cfg.TRAIN.FG_FRACTION为0.25，即在一次分类训练中前景框只能有32个
    fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

    # Sample rois with classification labels and bounding box regression
    # targets
    # _sample_rois选择进行分类训练的框，并求取他们类别和坐标的ground truth和计算边框损失loss时需要的bbox_inside_weights
    labels, rois, bbox_targets, bbox_inside_weights = _sample_rois(
        all_rois, gt_boxes, fg_rois_per_image,
        rois_per_image, self._num_classes)

    if DEBUG:
        print 'num fg: {}'.format((labels > 0).sum())
        print 'num bg: {}'.format((labels == 0).sum())
        self._count += 1
        self._fg_num += (labels > 0).sum()
        self._bg_num += (labels == 0).sum()
        print 'num fg avg: {}'.format(self._fg_num / self._count)
        print 'num bg avg: {}'.format(self._bg_num / self._count)
        print 'ratio: {:.3f}'.format(float(self._fg_num) / float(self._bg_num))

    # sampled rois  采样之后最终保留的全部预测框
    top[0].reshape(*rois.shape)
    top[0].data[...] = rois

    # classification labels 预测框的分类
    top[1].reshape(*labels.shape)
    top[1].data[...] = labels

    # bbox_targets 预测框与GT的残差
    top[2].reshape(*bbox_targets.shape)
    top[2].data[...] = bbox_targets

    # bbox_inside_weights
    top[3].reshape(*bbox_inside_weights.shape)
    top[3].data[...] = bbox_inside_weights

    # bbox_outside_weights
    top[4].reshape(*bbox_inside_weights.shape)
    top[4].data[...] = np.array(bbox_inside_weights > 0).astype(np.float32)

对预测框进行采样并计算残差，在GT上找到其对应的分类

def _sample_rois(all_rois, gt_boxes, fg_rois_per_image, rois_per_image, num_classes):
    """Generate a random sample of RoIs comprising foreground and background
    examples.
    """
    # overlaps: (rois x gt_boxes)
    # 计算所有roi和ground truth框之间的重合度
    # 只取坐标信息，roi中取第二到第五个数（因为补0了呀），ground truth框中取第一到第四个数
    overlaps = bbox_overlaps(
        np.ascontiguousarray(all_rois[:, 1:5], dtype=np.float),
        np.ascontiguousarray(gt_boxes[:, :4], dtype=np.float))
    gt_assignment = overlaps.argmax(axis=1) # 对于每个roi，找到对应的gt_box坐标 shape: [len(all_rois),]
    max_overlaps = overlaps.max(axis=1) # 对于每个roi，找到与gt_box重合的最大的overlap shape: [len(all_rois),]
    labels = gt_boxes[gt_assignment, 4] #对于每个roi，找到归属的类别: [len(all_rois),]

    # Select foreground RoIs as those with >= FG_THRESH overlap
    # 大于阈值的实际前景的数量
    fg_inds = np.where(max_overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Guard against the case when an image has fewer than fg_rois_per_image
    # foreground RoIs 求取用于回归的前景框数量
    fg_rois_per_this_image = min(fg_rois_per_image, fg_inds.size)
    # Sample foreground regions without replacement
    # 如果需要的话，就随机地排除一些前景框
    if fg_inds.size > 0:
        fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)

    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    # 找到属于背景的rois(就是与gt_box覆盖介于0和0.5之间的)
    bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
    # Compute number of background RoIs to take from this image (guarding
    # against there being fewer than desired)
    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image # 128-32个
    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size) # 以下操作同fg
    # Sample background regions without replacement
    if bg_inds.size > 0:
        bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)

    # The indices that we're selecting (both fg and bg)
    keep_inds = np.append(fg_inds, bg_inds) # 记录一下运算之后最终保留的框
    # Select sampled values from various arrays:
    labels = labels[keep_inds]  # 记录一下最终保留的框对应的label
    # Clamp labels for the background RoIs to 0
    labels[fg_rois_per_this_image:] = 0 # 把背景框的分类置0
    rois = all_rois[keep_inds] # 取出最终保留的rois

    # 得到最终保留的框的类别ground truth值和坐标变换ground truth值，得到预测框的误差
    bbox_target_data = _compute_targets(
        rois[:, 1:5], gt_boxes[gt_assignment[keep_inds], :4], labels)

    # 得到最终计算loss时使用的ground truth边框回归值和bbox_inside_weights
    bbox_targets, bbox_inside_weights = \
        _get_bbox_regression_labels(bbox_target_data, num_classes)

    return labels, rois, bbox_targets, bbox_inside_weights

计算预测框残差：

def _compute_targets(ex_rois, gt_rois, labels):
    """Compute bounding-box regression targets for an image."""

    assert ex_rois.shape[0] == gt_rois.shape[0]
    assert ex_rois.shape[1] == 4
    assert gt_rois.shape[1] == 4

    targets = bbox_transform(ex_rois, gt_rois) # 获得预测框与gt的残差
    if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED: # 是否需要进行归一化
        # Optionally normalize targets by a precomputed mean and stdev
        targets = ((targets - np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS))
                / np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS))
    # 将残差插到lable的后面（水平插入）
    return np.hstack(
            (labels[:, np.newaxis], targets)).astype(np.float32, copy=False)

整理数据到需要的格式：

def _get_bbox_regression_labels(bbox_target_data, num_classes):
    """Bounding-box regression targets (bbox_target_data) are stored in a
    compact form N x (class, tx, ty, tw, th)

    This function expands those targets into the 4-of-4*K representation used
    by the network (i.e. only one class has non-zero targets).

    Returns:
        bbox_target (ndarray): N x 4K blob of regression targets
        bbox_inside_weights (ndarray): N x 4K blob of loss weights
    """

    clss = bbox_target_data[:, 0]  # 每个预测框通过重叠面积与gt比较得到的分类
    # 对应分类上预测框的误差
    bbox_targets = np.zeros((clss.size, 4 * num_classes), dtype=np.float32)
    # 用全0初始化一下bbox_inside_weights
    bbox_inside_weights = np.zeros(bbox_targets.shape, dtype=np.float32)
    inds = np.where(clss > 0)[0] # 非背景类
    for ind in inds:
        cls = clss[ind]
        start = 4 * cls # 找到从属的类别对应的坐标回归值的起始位置
        end = start + 4  # 找到从属的类别对应的坐标回归值的结束位置
        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]  #在对应类的坐标回归上置相应的值（预测框误差）
        # 将bbox_inside_weights上的对应类的坐标回归值置1
        bbox_inside_weights[ind, start:end] = cfg.TRAIN.BBOX_INSIDE_WEIGHTS # (1.0, 1.0, 1.0, 1.0)
    return bbox_targets, bbox_inside_weights

（5）generate.py
使用RPN从IMDB输入数据上产生目标候选。
Generate object detection proposals from an imdb using an RPN.
现在对RPN网络的结构和RPN模块中文件有了一个大体的认识，那么接下来就开始阅读里面的实现代码，看看它究竟干了些什么事情。