物体检测 Faster R-CNN(二) 基于Resnet的Faster R-CNN网络模型

最新推荐文章于 2022-11-18 00:30:15 发布

WYXHAHAHA123

最新推荐文章于 2022-11-18 00:30:15 发布

阅读量1w

点赞数 14

分类专栏： Faster R-CNN

本文链接：https://blog.csdn.net/WYXHAHAHA123/article/details/86251919

版权

Faster R-CNN 专栏收录该内容

2 篇文章 2 订阅

订阅专栏

先记录下今天听到大神的言论：现在我们做的计算机视觉任务，可以分为3类，一类是降维，比如将原始输入图像降维成类别概率向量（num_classes*1），这是分类任务，第二类是输出与输入图像具有相同的分辨率（如图像去噪，风格迁移），比如使用深度学习对图像进行去噪，第三类是输出图像的分辨率将会比输入图像高，比如super resolution超分辨率，通常情况下，第一类任务在原始的数据集上就提供有ground truth 标注信息，但是对于第二类和第三类任务，通常并没有ground truth，只需要输出的图像合理即可。第二类和第三类任务通常使用深度学习中的deep generative model，要注意的是GAN并不是深度生成模型中的一种，它只是一个loss函数，是因为大家发现使用GAN的形式作为loss函数将会是生成的图像中能够保留更多的细节信息，故而GAN才在深度学习生成模型中广泛使用。

在上篇blog提到，对训练数据集中的所有图像的进行了变换后，虽然不能保证所有变换后的图像分辨率相同，但是保证了在同一个 batch size中的图像具有完全相同的空间分辨率。

faster-rcnn.pytorch/lib/model/utils/config.py保存的是对于训练模型过程中的超参数设定，如训练的batch size，NMS阈值等

faster-rcnn-pytorch-master\cfgs\res101.yml保存的是网络模型本身的超参数，如ROI pooling的size

首先明确，程序训练Faster R-CNN是以端到端的方式进行训练的。将detector的base network（又称为backbone，feature extraction network），通常使用性能很好的分类网络作为检测、分割等复杂任务的基础网络。现在设定为resnet101，一般分类网络的output stride=32（output stride的定义参见deeplab v3论文，指的是输出特征图分辨率相对于原始输入图像分辨率的缩小倍数，在分类网络中通常指的是，进行global average pooling之前的特征图分辨率相对于输入图像分辨率的缩小倍数）。

关于RPN的训练过程以及对后面的Fast R-CNN生成region proposal的过程，http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/讲解得非常棒，推荐。

需要注意的是，无论是对于RPN还是Fast R-CNN中的classifier还是regression网络，都只是经过 1*1卷积操作（对于RPN）或者nn.Linear全连接层（Fast R-CNN），并没有再经过任何的激活函数操作通过观察bbox_target_data也就是希望Fast R-CNN所预测出来的坐标回归偏移量可以发现，它有很多-12， 14这种数值，所以并不能对regression进行Relu或者sigmoid的激活函数操作，而对于classifier，由于在计算损失函数时使用的是F.cross_entropy，损失函数的代码实现中就包含了softmax操作和nn.NLLloss 故而classifier部分也不需要进行激活函数的操作

在一个step中加载一个batch size的训练数据，必须要保证对于不同文件名的输入图像，__getitem__()函数返回的每一项都具有相同的shape，才能进行批量的前向传播操作。这里有一点要注意，原版代码中的__C.MAX_NUM_GT_BOXES = 20，即允许了训练过程中的最大ground truth包围框个数为20，而我所使用的数据集中('max num_objs', 91, 'min num_objs', 1)，故而在加载图像gt框信息的步骤中，就剪掉了好多信息。故而我在config.py中将最大框的个数改成了100.

iter for one step output
变量	类型	shape
data	torch.tensor	[batch_size,3,H,W]，只减去了均值
im_info	torch.tensor	[batch_size,3]
gt_boxes	torch.tensor	[batch_size,20,5]
num_boxes	torch.tensor	[batch_size,]

fasterRCNN = resnet(imdb.classes, 101, pretrained=True, class_agnostic=args.class_agnostic)

在主函数中定义的fasterRCNN是resnet类的实例化，而 class resnet(_fasterRCNN): resnet类是_fasterRCNN类的子类。

class _fasterRCNN(nn.Module):
    """ faster RCNN """
    def __init__(self, classes, class_agnostic):
        super(_fasterRCNN, self).__init__()
        self.classes = classes
        self.n_classes = len(classes)
        self.class_agnostic = class_agnostic
        # loss
        self.RCNN_loss_cls = 0
        self.RCNN_loss_bbox = 0
        '''
        imdb.classes    tuple类型
        ('__background__', 'specularity', 'saturation', 'artifact', 'blur', 'contrast', 'bubbles', 'instrument')
        self.class_agnostic   False
        '''
    def forward(self, im_data, im_info, gt_boxes, num_boxes):

下文都在讲解_fasterRCNN类的forward方法

一、base_feat = self.RCNN_base(im_data)

self.RCNN_base = nn.Sequential(resnet.conv1, resnet.bn1,resnet.relu,
  resnet.maxpool,resnet.layer1,resnet.layer2,resnet.layer3)
# feed image data to base model to obtain base feature map

将输入图像输入到resnet101，通过从输入到resnet的第三个卷积阶段进行卷积特征提取，此时特征图的output stride=16，将输入图像分辨率缩小了16倍。

二、rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes) RPN完整前向传播

# feed base feature map tp RPN to obtain rois
rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
self.RCNN_rpn = _RPN(self.dout_base_model)
#self.dout_base_model=1024，指的是输出base feature map特征图通道数

1.在base feature map基础上，对anchor进行类别预测和位置偏移预测

batch_size = base_feat.size(0)

 # return feature map after convrelu layer
 rpn_conv1 = F.relu(self.RPN_Conv(base_feat), inplace=True)
 #output_stride=16,output_channel=1024
 # get rpn classification score
 rpn_cls_score = self.RPN_cls_score(rpn_conv1)
 #output_stride=16,output_channels=2*9=18

 rpn_cls_score_reshape = self.reshape(rpn_cls_score, 2)##[batch_size,2,(H/16)*(W/16)/2,18]
 rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape, 1)
 rpn_cls_prob = self.reshape(rpn_cls_prob_reshape, self.nc_score_out)#[-1,18]

 '''
base_feat   
是输入图像送入 RCNN_base网络（即从输入一直到resnet conv_layer3第3个阶段的卷积层输出）
空间分辨率，output stride=16，特征图通道数1024   
先进行3*3卷积，得到RPN分类和位置回归的共享特征图    output stride=16，output channels=512    
'''
 # get rpn offsets to the anchor boxes
 rpn_bbox_pred = self.RPN_bbox_pred(rpn_conv1)
 '''
无论是前景背景的类别预测或者是4个坐标值的预测
都是使用1*1的卷积操作，分别得到通道数为18的类别分数图和类别数为36的位置回归图       
 '''

无论是对anchor类别的预测还是对于位置偏移量的预测，都是只经过了卷积操作，并没有经过激活函数操作。

2.rois = self.RPN_proposal((rpn_cls_prob.data, rpn_bbox_pred.data, im_info, cfg_key))

包括generate anchor和proposal layer

self.RPN_proposal = _ProposalLayer(self.feat_stride, self.anchor_scales, self.anchor_ratios)

self.feat_stride=cfg.FEAT_STRIDE[0]#__C.FEAT_STRIDE = [16, ]
self.anchor_scales = cfg.ANCHOR_SCALES#[8,16,32]
self.anchor_ratios = cfg.ANCHOR_RATIOS#[0.5,1,2]

（1）generate_anchors(scales=np.array(scales), ratios=np.array(ratios))

这一步骤并没有在特征图上的每个像素点上都生成3*3=9个anchor boxes坐标，而是只会在特征图上第(0,0)个位置的像素点上生成9个anchor位置坐标（位置坐标是在参与训练的输入图像分辨率上的坐标，是对数据集中原始图像经过尺度变换和padding之后的图像），

由于进行1*1卷积来预测前景背景分数和位置偏移的特征图的output stride=16
则对于特征图上第(0,0)位置上的像素点，它对应于输入图像上的窗口是
(0, 0, 15, 15)
RPN模型输入的特征图  output stride=16，3种不同宽高比，3种不同尺度[8,16,32]
在特征图的每个像素点上生成9个anchors
返回值是转换成[xmin，ymin，xmax，ymax]格式的anchor坐标，在（进行尺度缩放变换和padding操作之后的）
输入图像上的分辨率，
关于尺度，
对于output stride=16的特征图，基本的尺度anchor areas=16*16=256
则尺度值分别设定为8   16   32   表示在基本的尺度上宽度和高度都乘以相应的倍数
表示产生的anchor boxes面积分别为    (8*8)*256   (16*16)*256   (32*32)*256

（2) 对RPN预测的anchor位置偏移量进行解码

torch.new() method 产生与当前torch.tensor具有相同数据类型的tensor，shape不一定相同

https://stackoverflow.com/questions/49263588/pytorch-beginner-tensor-new-method

下面代码摘自 \faster-rcnn-pytorch-master\lib\model\rpn\proposal_layer.py def forward()方法

foward方法的输入：

rois = self.RPN_proposal((rpn_cls_prob.data, rpn_bbox_pred.data,
                         im_info, cfg_key))
rpn_cls_prob    torch.tensor shape   [batch_size,18,(H/16),(W/16)]
rpn_bbox_pred   torch.tensor shape   [batch_size,36,(H/16),(W/16)]
im_info         torch.tensor shape   [batch_size,3]
cfg_key = 'TRAIN'

scores = input[0][:, self._num_anchors:, :, :]
 '''
rpn_cls_prob    torch.tensor shape   [batch_size,18,(H/16),(W/16)]
第1个维度上的通道数为18，设定前0~8个通道表示的是在该像素点上的9个anchor boxes是背景（background）的概率
设定从9~17个通道表示的是在该像素点上的9个anchor boxes是前景（foreground）的概率        
'''
 bbox_deltas = input[1]
 im_info = input[2]
 cfg_key = input[3]

 pre_nms_topN  = cfg[cfg_key].RPN_PRE_NMS_TOP_N
 '''
 # Number of top scoring boxes to keep before apply NMS to RPN proposals
 __C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
 根据scores （获取到特征图上所有像素点的所有anchor预测为前景的概率值），在进行NMS算法之前取出
 前12000个anchor（当然是根据RPN输出坐标偏移量进行坐标调整之后的坐标）
 '''
 post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N
 '''
 # Number of top scoring boxes to keep after applying NMS to RPN proposals
 __C.TRAIN.RPN_POST_NMS_TOP_N = 2000
 将所有经过位置调整之后的anchor boxes经过前滤波，阈值为0.7的NMS算法之后，再在
 剩下的所有位置调整之后的anchor boxes中根据score选择前2000个作为region proposal
 传入到后面的fast R-CNN网络    
'''
 nms_thresh    = cfg[cfg_key].RPN_NMS_THRESH
 '''__C.TRAIN.RPN_NMS_THRESH = 0.7'''
 min_size      = cfg[cfg_key].RPN_MIN_SIZE
 '''
 # Proposal height and width both need to be greater than RPN_MIN_SIZE (at orig image scale)
 __C.TRAIN.RPN_MIN_SIZE = 8
 在输入图像尺度上，RPN产生的region proposal最小的分辨率应该大于8*8
 这个变量并没有起到作用
 '''
 batch_size = bbox_deltas.size(0)#再次说明RPN位置预测的是anchor 的偏移量

 feat_height, feat_width = scores.size(2), scores.size(3)
 #特征图分辨率，等于用于训练的输入图像分辨率的1/16
 #用于训练的输入图像是数据集中原始的图像经过尺度缩放和padding后的
 shift_x = np.arange(0, feat_width) * self._feat_stride#shape   [feat_width]
 shift_y = np.arange(0, feat_height) * self._feat_stride#shape  [feat_height]
 shift_x, shift_y = np.meshgrid(shift_x, shift_y)
 shifts = torch.from_numpy(np.vstack((shift_x.ravel(), shift_y.ravel(),
                           shift_x.ravel(), shift_y.ravel())).transpose())
 shifts = shifts.contiguous().type_as(scores).float()
 #shifts    shape    [(H/16),(W/16)]
 #H和W分别表示对于数据集中原始图像经过尺度缩放和padding操作之后的真正输入到网络中
 #参与训练的图像

 A = self._num_anchors#9
 K = shifts.size(0)#表示特征图面积（等于特征图高度*宽度）

 self._anchors = self._anchors.type_as(scores)
 # anchors = self._anchors.view(1, A, 4) + shifts.view(1, K, 4).permute(1, 0, 2).contiguous()
 anchors = self._anchors.view(1, A, 4) + shifts.view(K, 1, 4)#output shape   [K,A,4]
 anchors = anchors.view(1, K * A, 4).expand(batch_size, K * A, 4)#output shape [batch_size, K * A, 4]
 '''
 具体操作参考pytorch中的broadcast机制
 现在的anchors   shape   [batch_size, K * A, 4]    K*A表示一张输入图像中所有anchor boxes的数量
 坐标形式[xmin,ymin,xmax,ymax]在用于训练的输入图像分辨率上的坐标
 总结：获取anchor boxes坐标的过程（代码实现）非常巧妙
 并不是一次性产生对于特征图上所有像素点处的9个anchor boxes
 而是首先在特征图第(0,0)位置上的像素点产生9个anchor boxes的位置相对坐标（这里的相对坐标是
 指相对于(0,0,16，16)这个网格的坐标），然后再利用类似于sliding window的思想
 直接将第(0,0)位置的9个anchor boxes坐标加到特征图分辨率上的所有像素点处
 '''

 # Transpose and reshape predicted bbox transformations to get them
 # into the same order as the anchors:

 bbox_deltas = bbox_deltas.permute(0, 2, 3, 1).contiguous()#(batch_size,(H/16),(W/16),36)
 bbox_deltas = bbox_deltas.view(batch_size, -1, 4)#(batch_size,(H/16)*(W/16)*9, 4)

 # Same story for the scores:
 scores = scores.permute(0, 2, 3, 1).contiguous()#(batch_size,(H/16),(W/16),9)
 scores = scores.view(batch_size, -1)#(batch_size,(H/16)*(W/16)*9)

 # Convert anchors into proposals via bbox transformations
 proposals = bbox_transform_inv(anchors, bbox_deltas, batch_size)
 '''
 根据anchors和RPN所预测的位置偏移量，对于RPN预测坐标值进行解码
 得到RPN预测值在（实际上输入网络的）训练图像分辨率上的绝对坐标值
 返回的proposals是RPN对于所有anchor boxes的预测坐标值   [xmin,ymin,xmax,ymax]形式
 shape   (batch_size,(H/16)*(W/16)*9, 4)
 '''

 # 2. clip predicted boxes to image   将RPN对于所有anchor boxes预测输出的proposal
 proposals = clip_boxes(proposals, im_info, batch_size)
 # proposals = clip_boxes_batch(proposals, im_info, batch_size)
 '''
 对于batch size中的没一张图像，将RPN预测出的region proposals解码到
 输入图像的分辨率上，并将解码后的候选框坐标限制在输入图像分辨率内
 '''

对于预测的score让它的后面8~17个通道的数值预测特征图上的当前像素点处的9个anchor boxes是前景的概率。

（3）根据RPN预测的scores和bbox_deltas，选择出2000个region proposal用于输入到后面的Fast R-CNN模型中用于后续训练

这里想解释一下anchor、region proposal和bounding boxes名词的区别，虽然从表面上看他们都是矩形包围框，但是作为物体检测中的重要术语，十分有必要区分开。以下是笔者拙见：

名词	含义
anchor	翻译为锚框/预选框，生成anchor boxes的过程只需要(特征图分辨率，anchor_scale,anchor_ratio)信息，它的坐标信息完全不是由神经网络前向传播预测得到的，只是作为最原始的预选框，是诸如RPN，SSD（以及YOLO等one-stage method）等密集检测（dense detection）模型中最先需要生成的，因为网络模型预测的输出值是在anchor坐标上的偏移量
region proposal	翻译为候选框/ROI/candidate boxes，通常在two stage detector中出现，它指的是经过一定的经验知识（将图像输入到selective search算法中）或者是根据网络学习到的知识预测（RPN输出，对于原始的anchor boxes进行位置调整之后的框）得到的可能是前景区域的矩形框，这些框也被称为candidate boxes/ROI，是因为已经根据一些先验知识或者网络学习提取到一些特征，可以有一定根据说这些框是前景框，而对于RPN输出，则对于每个region proposal都附加一个概率值，表示proposal是前景框的概率简言之，region proposal是将RPN网络输出预测偏移量对于事先设定好的anchor boxes进行解码之后的在输入图像分辨率上的候选框
bounding boxes	翻译为包围框，它指的是最终网络输出的transformed bounding boxes，就是整个网络最终的输出，在测试阶段就是用来计算mAP的框。比如描述ground truth包围框的时候，通常会说ground truth bounding boxes

下面代码摘自 \faster-rcnn-pytorch-master\lib\model\rpn\proposal_layer.py def forward()方法

scores_keep = scores#(batch_size,(H/16)*(W/16)*9)
 proposals_keep = proposals
 # 先对RPN输出的对于anchor boxes预测偏移量进行解码得到proposal，再将proposal的坐标范围
 # 限制在输入图像分辨率上，得到proposals_keep
 _, order = torch.sort(scores_keep, 1, True)
 #对于batch size中的每一张输入图像，对于(H/16)*(W/16)*9个region proposal按照RPN预测的foreground分数进行
 #降序排列，返回的第一个参数是降序排列后的分数数值，第二个参数是降序排列的位置索引编号
 #order   shape   [batch_size,(H/16)*(W/16)*9]

 output = scores.new(batch_size, post_nms_topN, 5).zero_()
 '''
 output shape  [batch_size, post_nms_topN, 5]
torch.new()   method    产生与当前torch.tensor具有相同datatype的 new tensor
post_nms_topN=2000，表示最终传送到后面Fast R-CNN模型的候选框（region proposal）  2000个
这就很明显可以看出RPN所起到的作用就是R-CNN模型和Fast R-CNN模型中的selective search算法的功能
只不过selective search算法不需要学习训练的过程，而RPN是通过对网络进行训练得到region proposal       
selective search算法也是为后面的检测器提供后续框

可以看出，
R-CNN模型和Fast R-CNN模型由于是通过selective search算法产生了2000个region proposal，
所以不能算是dense detection
而Faster R-CNN是通过训练RPN得到region proposal/ROI
SSD/YOLO 更是通过sliding window形式产生anchor boxes，是绝对意义上的dense detection
'''
 for i in range(batch_size):
     # # 3. remove predicted boxes with either height or width < threshold
     # # (NOTE: convert min_size to input image scale stored in im_info[2])
     proposals_single = proposals_keep[i]
     scores_single = scores_keep[i]

     # # 4. sort all (proposal, score) pairs by score from highest to lowest
     # # 5. take top pre_nms_topN (e.g. 6000)
     order_single = order[i]

     if pre_nms_topN > 0 and pre_nms_topN < scores_keep.numel():
         order_single = order_single[:pre_nms_topN]

     proposals_single = proposals_single[order_single, :]
     scores_single = scores_single[order_single].view(-1,1)

     # 6. apply nms (e.g. threshold = 0.7)
     # 7. take after_nms_topN (e.g. 300)
     # 8. return the top proposals (-> RoIs top)

     keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh, force_cpu=not cfg.USE_GPU_NMS)
     keep_idx_i = keep_idx_i.long().view(-1)

     if post_nms_topN > 0:
         keep_idx_i = keep_idx_i[:post_nms_topN]
     proposals_single = proposals_single[keep_idx_i, :]
     scores_single = scores_single[keep_idx_i, :]

     # padding 0 at the end.
     num_proposal = proposals_single.size(0)
     output[i,:,0] = i
     output[i,:num_proposal,1:] = proposals_single

 return output
 '''
 #shape   [batch_size,2000,5]
在第3个维度上，第0列表示当前的region proposal属于batch size中的哪一张图像编号的
第1~4列表示region proposal在经过变换之后的输入图像分辨率上的坐标  [xmin,ymin,xmax,ymax]    
 '''

现在再总结一下目前前向传播到现在为止的过程：
step1 RPN network forward

先通过RPN_base_network产生对于RPN的分类网络和回归网络的shared feature map共享特征图（是通过对于resnet101第3个阶段卷积输出的特征图——output stride=16，output channels=1024再进行3*3卷积进一步提取语义特征得到的共享特征图——output stride=16，output channels=512），然后在共享特征图上分别应用1*1，output channels=18的卷积操作得到RPN对于foreground和background前景背景类别分数预测 rpn_cls_prob shape（经过对于每个anchor boxes的两个通道上进行softmax操作之后的） [batch_size,18,H/16,W/16]，应用1*1，output channels=36的卷积操作得到RPN对于特征图上每个像素点上的9个anchor boxes的位置偏移量的预测，rpn_bbox_pred shape [batch_size,36,H/16,W/16]。

step2 generate_anchor

通过RPN的generate_anchor.py生成对于特征图上第(0,0)位置像素点上的9个anchor boxes，shape [9,4]

scales: self.anchor_scales = cfg.ANCHOR_SCALES#[8,16,32]

ratios: self.anchor_ratios = cfg.ANCHOR_RATIOS#[0.5,1,2]

step3 proposal_layer.py

proposal layer    是 nn.Module  的子类，接在RPN网络的输出之后
其输入为对于图像设定的一系列密集候选框（anchor boxes）
，可认为是通过sliding window策略得到的dense candidate regions
现在还是在输入图像分辨率上的  xmin  ymin xmax ymax  格式

(1)根据RPN的输出和anchor 进行解码，得到RPN生成的映射到输入图像上面的预测bounding boxes
对于region proposal network的输出，将网络对于每个anchor boxes的预测值进行解码
解码成为（transformed anchor boxes），即：
输入：
    RPN_bounding boxes_prediction(预测的当然是经过normalization编码之后的数值)    
    anchors(generate layers生成的，在输入图像像素分辨率空间中)
输出：
    RPN网络输出的，解码之后，映射到输入图像分辨率上面的region proposal，这里可以暴力地对于所有（H/16）*(W/16)*9个anchor解码
    解码之后的region proposal是映射到输入图像上面的（并将坐标值限定在 600*800像素范围内，但并不改变bounding boxes个数）    
    解码后region proposal格式  xmin ymin xmax ymax     [batch_size,（H/16）*(W/16)*9, 4]
(2)这一步主要是为了从RPN输出的  （H/16）*(W/16)*9个region proposal（已经转换成输入图像分辨率上）中挑选出一部分用于对后续 Fast R-CNN 的训练
    首先根据RPN网络预测出对于每个anchor boxes是前景概率值scores进行排序，选择出 top M
    引入超参数  RPN_PRE_NMS_TOP_N  12000   表示进行NMS算法之前，挑选出top N个  transformed bounding boxes
    然后将这N个transformed bounding boxes进行NMS算法，
    引入超参数    NMS_THRESH   0.7
    引入超参数    RPN_POST_NMS_TOP_N = 2000
    最后从NMS算法输出的bounding boxes中挑选出  top N 2000 个bounding boxes
    返回 top N  的ROI scores 和ROI  regions

proposal layer 返回值 rois 返回值rois并不用于对于RPN的训练和损失函数计算，而是用后续对于Fast R-CNN的训练。

'''
 #shape   [batch_size,2000,5]
在第3个维度上，第0列表示当前的region proposal属于batch size中的哪一张图像编号的
第1~4列表示region proposal在经过变换之后的输入图像分辨率上的坐标  [xmin,ymin,xmax,ymax]    
 '''

3.rpn_data = self.RPN_anchor_target((rpn_cls_score.data, gt_boxes, im_inf

o, num_boxes))

self.RPN_anchor_target = _AnchorTargetLayer(self.feat_stride, self.anchor_scales, self.anchor_ratios)

anchor target layer
将anchor boxes(没有经过任何操作的，没有经过网络模型预测值的)与ground truth boxes对比
对于每个anchor boxes，都计算出它和每个ground truth boxes 之间的IOU值，得到overlap
对于每个anchor boxes，遍历所有ground truth，找到它与所有ground truth最大的overlap值
得到   max_overlaps   shape [#anchor_boxes]
对于每个ground truth boxes，遍历所有的anchor boxes，找到它与所有anchor boxes最大的overlap 值
得到   gt_max_overlaps  shape [#gt_boxes]

选择出所有正样本：
    （1）对于每个ground truth boxes，与它具有最大的overlap 的anchor boxes是正样本
    （2）对于每个anchor boxes，只要它与任意的ground truth boxes之间的IOU值大于0.7    代码中用到了（1）（2）
选择出正样本后，对所有前景正样本进行坐标编码（generate good bounding boxes regression coefficients）
实际上代码实现的时候，是对图像中的每个anchor boxes都分配了ground truth boxes值，无论最后anchor boxes被分为正样本
还是负样本，anchor boxes与哪个gt boxes的overlap最大，就认为它是哪个gt boxes 的正样本，然后进行位置编码

所有overlap 小于0.3的anchor记作负样本

anchor target layers 是用来对于每个anchor boxes区分正负样本，并计算出所有正样本编码后的数值
计算出每个anchor boxes与每个ground truth boxes之间的overlap值，并对于每个anchor boxes
找到与它具有最大IOU 的ground truth boxes，并指保留住这个overlap值
如果overlap值大于0.7    则anchor 是正样本
如果overlap值小于0.3    则anchor 是负样本

anchor target layer

以下代码摘自：faster-rcnn-pytorch-master\lib\model\rpn\anchor_target_layer.py的forward方法

def forward(self, input):
        '''

        :param input:
              rpn_cls_score.data, shape  [batch_size,18,H/16,W/16]   1*1  output_channels=18 卷积输出
              gt_boxes,  shape [batch_size,20,5]
              im_info,  shape  [batch_size,3]
              num_boxes  shape  [batch_size]
        :return:
        '''
        # Algorithm:
        #
        # for each (H, W) location i
        #   generate 9 anchor boxes centered on cell i
        #   apply predicted bbox deltas at cell i to each of the 9 anchors
        # filter out-of-image anchors

        rpn_cls_score = input[0]
        gt_boxes = input[1]
        im_info = input[2]
        num_boxes = input[3]

        # map of shape (..., H, W)
        height, width = rpn_cls_score.size(2), rpn_cls_score.size(3)

        batch_size = gt_boxes.size(0)

        feat_height, feat_width = rpn_cls_score.size(2), rpn_cls_score.size(3)
        shift_x = np.arange(0, feat_width) * self._feat_stride
        shift_y = np.arange(0, feat_height) * self._feat_stride
        shift_x, shift_y = np.meshgrid(shift_x, shift_y)
        shifts = torch.from_numpy(np.vstack((shift_x.ravel(), shift_y.ravel(),
                                  shift_x.ravel(), shift_y.ravel())).transpose())
        shifts = shifts.contiguous().type_as(rpn_cls_score).float()
        #计算出所有的anchor boxes坐标  [xmin ymin xmax ymax]
        #还是需要先进性base anchors生成    借助于   generate_anchors.py

        A = self._num_anchors
        K = shifts.size(0)

        self._anchors = self._anchors.type_as(gt_boxes) # move to specific gpu.
        all_anchors = self._anchors.view(1, A, 4) + shifts.view(K, 1, 4)
        all_anchors = all_anchors.view(K * A, 4)#shape   [(H/16)*(W/16)*9,4]

        total_anchors = int(K * A)#当前分辨率的特征图上anchor 总数

        keep = ((all_anchors[:, 0] >= -self._allowed_border) &
                (all_anchors[:, 1] >= -self._allowed_border) &
                (all_anchors[:, 2] < long(im_info[0][1]) + self._allowed_border) &
                (all_anchors[:, 3] < long(im_info[0][0]) + self._allowed_border))

        inds_inside = torch.nonzero(keep).view(-1)

        # keep only inside anchors   去除所有图像边界之外的anchor框
        anchors = all_anchors[inds_inside, :]

        # label: 1 is positive, 0 is negative, -1 is dont care
        labels = gt_boxes.new(batch_size, inds_inside.size(0)).fill_(-1)
        '''
        labels   shape  (batch_size, inds_inside.size(0))
        虽然同一个batch size中每张训练图像的ground truth bounding boxes信息不相同，但是
        由于一个batch size中的训练图像具有完全相同的空间分辨率，则它们的anchor boxes数以及
        在图像边界之内的anchor boxes数量及位置信息都相同        
        '''
        bbox_inside_weights = gt_boxes.new(batch_size, inds_inside.size(0)).zero_()
        bbox_outside_weights = gt_boxes.new(batch_size, inds_inside.size(0)).zero_()

        overlaps = bbox_overlaps_batch(anchors, gt_boxes)
        '''
        现在先可以简单理解为
        anchors    shape   [inds_inside,4]
        gt_boxes   shape   [batch_size,20,5]    
        overlaps   shape   [batch_size,num_anchors, num_max_gt]   num_max_gt=20 or 100
        表示每一张训练图像，每一个anchor boxes与每一个ground truth boxes框之间的overlap
        '''

        max_overlaps, argmax_overlaps = torch.max(overlaps, 2)
        '''
        max_overlaps     shape  [batch_size,num_anchors]    anchor boxes与所有gt boxes之间最大的IOU值  range  (0,1)
        argmax_overlaps  shape  [batch_size,num_anchors]     anchor boxes与哪个gt boxes的IOU最大，位置索引 range (0,num_max_gt-1)
        表示对于batch size中的每一张图像，对于每个anchor boxes，遍历所有的gt_boxes
        找到当前的anchor boxes与哪一个gt_boxes 之间的overlap最大，就认为这个anchor boxes
        与ground truth boxes之间的IOU是多少
       '''

        gt_max_overlaps, _ = torch.max(overlaps, 1)
        '''
        gt_max_overlaps  shape  [batch_size,num_max_gt]
        表示对于batch size中的每一张图像，对于每个gt_boxes,遍历所有的anchor boxes，
        找到与当前的gt_boxes具有最大IOU的anchor boxes  记录这个overlap
        '''

        if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:#RPN_CLOBBER_POSITIVES=False
            labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0
            '''
            # IOU < thresh: negative example    
            __C.TRAIN.RPN_NEGATIVE_OVERLAP = 0.3
            对于一个batch size中的每个训练图像的每个anchor boxes，
            计算anchor boxes 与当前训练图像的所有gt_boxes的overlap
            如果最大的overlap值都小于所设定的  RPN负样本IOU阈值，则
            当前anchor boxes就是负样本
            这里的labels  shape [batch_size,num_keep_anchors]
            max_overlaps   shape [batch_size,num_anchors]
            使用torch.tensor的element wise operation以避免使用for循环
            将IOU小于0.3的anchor boxes设置成负样本     
            '''

        gt_max_overlaps[gt_max_overlaps==0] = 1e-5
        keep = torch.sum(overlaps.eq(gt_max_overlaps.view(batch_size,1,-1).expand_as(overlaps)), 2)

        if torch.sum(keep) > 0:
            labels[keep>0] = 1#将与ground truth boxes具有最大overlap的anchor设置为正样本

        # fg label: above threshold IOU
        labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1#RPN_POSITIVE_OVERLAP=0.7

        if cfg.TRAIN.RPN_CLOBBER_POSITIVES:#False
            labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

        num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE)
        '''
        __C.TRAIN.RPN_FG_FRACTION = 0.5   前景样本比例
        # Total number of examples
        __C.TRAIN.RPN_BATCHSIZE = 256  
        
        训练RPN的batch size=256
        表示计算RPN损失函数时前景/背景或正/负样本共计算256个anchor boxes的损失值
        这个batch size并不体现在任何前向传播的操作中，只是表示RPN选择多少个样本计算损失 
        
        就是说，对于image_batch_size数量的输入图像前向传播到了RPN，
        对于同一batch size的每一张图像都会生成相同数量，相同坐标的anchor boxes
        对于每一张图像就选择256个样本计算损失   
        并不是在一个batch size 的anchor boxes中一起进行选择的
        '''

        sum_fg = torch.sum((labels == 1).int(), 1)
        #labels  shape [batch_size,num_keep_anchors]
        #sum_fg  shape [batch_size,] 一个batch size中每张图像中的正样本总数
        sum_bg = torch.sum((labels == 0).int(), 1)

        for i in range(batch_size):
            # subsample positive labels if we have too many
            '''
           对于batch size中的每一张图像，如果正样本数量大于128，则对当前图像的所有正样本anchor  下采样          
           '''
            if sum_fg[i] > num_fg:
                fg_inds = torch.nonzero(labels[i] == 1).view(-1)
                #shape  [num_positive_anchors,] 为当前图像中正样本anchor数
                #labels[i] == 1  前景正样本anchor为1，背景负样本为0
                #fg_inds   表示在所有留下来的anchor boxes中，正样本的索引序号

                # torch.randperm seems has a bug on multi-gpu setting that cause the segfault.
                # See https://github.com/pytorch/pytorch/issues/1868 for more details.
                # use numpy instead.
                #rand_num = torch.randperm(fg_inds.size(0)).type_as(gt_boxes).long()
                rand_num = torch.from_numpy(np.random.permutation(fg_inds.size(0))).type_as(gt_boxes).long()
                '''
                numpy.random.permutation(x) 
               If x is an integer, randomly permute np.arange(x).随机打乱
               permutation = list(np.random.permutation(10))
                [5, 1, 7, 6, 8, 9, 4, 0, 2, 3]  
                rand_num   对当前训练图像中num_positive_anchors个正样本索引序号进行随机打乱
               '''
                '''
                所有overlap小于0.3的是负样本，overlap大于0.7的为正样本
                然后对于batch size中的每张训练图像，随机采样出128个正样本和128个负样本            
                '''
                disable_inds = fg_inds[rand_num[:fg_inds.size(0)-num_fg]]
               #取出前  (所有正样本数-128)个   不作为正样本考虑
                labels[i][disable_inds] = -1

            '''至此，一定能够保证，正样本数量小于或者等于128'''

#           num_bg = cfg.TRAIN.RPN_BATCHSIZE - sum_fg[i]
            num_bg = cfg.TRAIN.RPN_BATCHSIZE - torch.sum((labels == 1).int(), 1)[i]
            #负样本的数量可能大于或等于128

            # subsample negative labels if we have too many
            if sum_bg[i] > num_bg:
                bg_inds = torch.nonzero(labels[i] == 0).view(-1)
                #rand_num = torch.randperm(bg_inds.size(0)).type_as(gt_boxes).long()

                rand_num = torch.from_numpy(np.random.permutation(bg_inds.size(0))).type_as(gt_boxes).long()
                disable_inds = bg_inds[rand_num[:bg_inds.size(0)-num_bg]]
                labels[i][disable_inds] = -1

        offset = torch.arange(0, batch_size)*gt_boxes.size(1)#(0,20,40,60)

        argmax_overlaps = argmax_overlaps + offset.view(batch_size, 1).type_as(argmax_overlaps)
        '''
        argmax_overlaps  shape  [batch_size,num_anchors]     
        anchor boxes与哪个gt boxes的IOU最大，位置索引 range (0,num_max_gt-1)   
        
        argmax_overlaps  [batch_size,num_anchors]     
        '''
        bbox_targets = _compute_targets_batch(anchors, gt_boxes.view(-1,5)[argmax_overlaps.view(-1), :].view(batch_size, -1, 5))
        '''
        anchors shape [num_keep_anchors,4]
        gt_boxes  shape [batch_size*num_max_gt,5]
        gt_boxes.view(-1,5)[argmax_overlaps.view(-1), :]         
        第二个参数     shape   (batch_size, num_keep_anchors, 5)
        
        意思是根据之前所计算出的在每张训练图像中的anchor boxes与对应的ground truth bounding boxes之间的最大overlap值
        给每个anchor boxes分配一个ground truth boxes（这个时候先并不对anchor boxes进行正负样本的区分，而是对于坐标范围在
        当前图像空间分辨率范围内的所有anchor boxes，看看它跟图像中的哪个gt_boxes的overlap最大，就认为这个anchor boxes所对应的
        gt是对应的ground truth boxes——这时候anchor所对应的gt是原始的训练图像上的位置坐标，并没有经过编码）
        
        bboxes_targets   就是对于batch size中的每张图像所有anchor boxes进行编码（认为anchor boxes对应的gt boxes是
        所有gt_boxes中与它具有最大overlap的gt_boxes）
        Faster R-CNN中的anchor 偏移量编码方式与Fast R-CNN模型中一样   
        
        bbox_targets    shape  (batch_size, num_keep_anchors, 4)    对于anchor 的gt编码后的位置坐标
        [targets_dx,targets_dy,targets_dw,targets_dh]格式
        '''

        # use a single value instead of 4 values for easy index.
        bbox_inside_weights[labels==1] = cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS[0]
        '''
        bbox_inside_weights   shape [batch_size, inds_inside.size(0)]   inds_inside.size(0)=num_keep_anchors
        __C.TRAIN.RPN_BBOX_INSIDE_WEIGHTS = (1.0, 1.0, 1.0, 1.0)
        # Give the positive RPN examples weight of p * 1 / {num positives}
        # and give negatives a weight of (1 - p)   
        
        这时候在labels中batch size中每张训练图像所对应的正负样本anchor已经被挑选出来了（对于每张图像共256个anchor，
        正样本数小于等于128，负样本数大于等于128）    
        labels    shape   [batch_size,num_keep_anchors]
        '''

        if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0:#-1
            num_examples = torch.sum(labels[i] >= 0)#256
            positive_weights = 1.0 / num_examples.item()#1/256
            negative_weights = 1.0 / num_examples.item()
        else:
            assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &
                    (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))

        bbox_outside_weights[labels == 1] = positive_weights#正样本计算损失时的权重     1/256
        bbox_outside_weights[labels == 0] = negative_weights

        labels = _unmap(labels, total_anchors, inds_inside, batch_size, fill=-1)
        bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, batch_size, fill=0)
        bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, batch_size, fill=0)
        bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, batch_size, fill=0)
        '''因为前面的操作中是将所有在输入图像空间分辨率范围之内的anchor boxes进行操作，现在希望重新将
        num_keep_anchors的信息映射回 (H/16)*(W/16)*9 即所有的anchor boxes中'''
        '''
        bbox_inside_weights   所有正样本anchor boxes为1  其他都为0
        bbox_outside_weights   所有正样本和负样本anchor boxes为1/256   其他都为0       
        '''

        outputs = []

        labels = labels.view(batch_size, height, width, A).permute(0,3,1,2).contiguous()
        '''
        labels.shape   [batch_size,(H/16)*(W/16)*9]
        labels.shape   [batch_size,9,(H/16),(W/16)]
        '''
        labels = labels.view(batch_size, 1, A * height, width)
        outputs.append(labels)#labels.shape   [batch_size,1,9*(H/16),(W/16)]

        bbox_targets = bbox_targets.view(batch_size, height, width, A*4).permute(0,3,1,2).contiguous()
        outputs.append(bbox_targets)#bbox_targets  shape  [batch_size,36,(H/16),(W/16)]

        anchors_count = bbox_inside_weights.size(1)
        bbox_inside_weights = bbox_inside_weights.view(batch_size,anchors_count,1).expand(batch_size, anchors_count, 4)

        bbox_inside_weights = bbox_inside_weights.contiguous().view(batch_size, height, width, 4*A)\
                            .permute(0,3,1,2).contiguous()
        #bbox_inside_weights   shape  [batch_size,36, height, width]

        outputs.append(bbox_inside_weights)

        bbox_outside_weights = bbox_outside_weights.view(batch_size,anchors_count,1).expand(batch_size, anchors_count, 4)
        bbox_outside_weights = bbox_outside_weights.contiguous().view(batch_size, height, width, 4*A)\
                            .permute(0,3,1,2).contiguous()
        outputs.append(bbox_outside_weights)
        #bbox_outside_weights  shape   [batch_size,36, height, width]

        return outputs

outputs

'''
#labels.shape [batch_size,1,9*(H/16),(W/16)]  正样本为1  负样本为0 不计算损失为-1 batch size中每张图像共256个正负样本
#bbox_targets  shape  [batch_size,36,(H/16),(W/16)]    
每个anchor boxes所对应的gt位置偏移量（对于anchor编码后的gt值，也就是希望RPN网络模型的输出值）
#bbox_inside_weights   shape  [batch_size,36, height, width] 所有正样本anchor boxes为1  其他都为0
#bbox_outside_weights  shape   [batch_size,36, height, width] 所有正样本和负样本anchor boxes为1/256   其他都为0         
'''

4.RPN损失函数计算 return rois, self.rpn_loss_cls, self.rpn_loss_box

rois 是batch size中的每张图像产生的2000个region proposal(由proposal layer层产生)

[batch_size,2000,5] 在第3个维度上，第0列表示当前的region proposal属于batch size中的哪一张图像编号的第1~4列表示region proposal在经过变换之后的输入图像分辨率上的坐标 [xmin,ymin,xmax,ymax]

rpn_loss_cls RPN分类损失对于batch size张图像，计算了batch size*256（RPN的batch size）个正样本anchor和负样本anchor的分类损失，使用的是cross entropy

pytorch 中的F.cross_entropy函数同时实现了softmax和nllloss操作，
故而rpn_cls_score可以只进行卷积操作预测分数，并不需要进行softmax操作  
一个batch size中共有batch_size*256个anchor boxes（加上正样本和负样本一共256个，前景样本比例0.5）参与了分类损失的计算
则分类损失的分母部分会自动是  batch_size*256   的权值    进行平均

rpn_loss_box RPN位置回归损失，对于batch size张图像，对于所有正样本（每张图像的正样本小于等于128）一起计算smooth L1回归损失，乘以权值1/256之后再还要取平均值。

三、roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes)

faster-rcnn-pytorch-master\lib\model\rpn\proposal_target_layer_cascade.py

class _ProposalTargetLayer(nn.Module):
    """
    Assign object detection proposals to ground-truth targets. Produces proposal
    classification labels and bounding-box regression targets.
    """

    def __init__(self, nclasses):
        super(_ProposalTargetLayer, self).__init__()
        self._num_classes = nclasses
        self.BBOX_NORMALIZE_MEANS = torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS)# (0.0, 0.0, 0.0, 0.0)
        self.BBOX_NORMALIZE_STDS = torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS)#(0.1, 0.1, 0.2, 0.2)
        self.BBOX_INSIDE_WEIGHTS = torch.FloatTensor(cfg.TRAIN.BBOX_INSIDE_WEIGHTS)#(1.0, 1.0, 1.0, 1.0)

    def forward(self, all_rois, gt_boxes, num_boxes):
        '''
        :param all_rois:
              是proposal layer层输出，表示将RPN视作为selective search算法，生成2000个region proposal
              具体的生成过程是：对于RPN产生的(H/16)*(W/16)*9个位置偏移量预测，与对应的anchor boxes信息
              对RPN产生的位置预测值进行解码，解码出在输入图像分辨率（就是对输入图像进行缩放）的位置坐标
              然后首先根据RPN网络模型预测出来的对于所有anchor boxes的前景类别分数，挑选出前12000个region proposal
              再进行阈值为0.7的NMS算法，然后再在NMS算法后留下的所有region proposal中找出前2000个，作为训练
              Fast R-CNN模型的输入    [batch_size,2000,5]
        :param gt_boxes: torch.tensor [batch_size,20,5] 从annotation.txt文件中读取出来的坐标信息，经过尺度变换后
        :param num_boxes:torch.tensor  [batch_size,]    batch size中每张训练图像中有多少个gt boxes
        其中all_rois是RPN模型的proposal layer层的输出，
        gt_boxes和num_boxes参数是整个Faster R-CNN模型的输入（从trainval_net.py中的dataloader数据加载其中读取得到）
        :return:
        '''

        self.BBOX_NORMALIZE_MEANS = self.BBOX_NORMALIZE_MEANS.type_as(gt_boxes)#(0.0, 0.0, 0.0, 0.0)
        self.BBOX_NORMALIZE_STDS = self.BBOX_NORMALIZE_STDS.type_as(gt_boxes)#(0.1, 0.1, 0.2, 0.2)
        self.BBOX_INSIDE_WEIGHTS = self.BBOX_INSIDE_WEIGHTS.type_as(gt_boxes)#(1.0, 1.0, 1.0, 1.0)

        gt_boxes_append = gt_boxes.new(gt_boxes.size()).zero_()
        gt_boxes_append[:,:,1:5] = gt_boxes[:,:,:4]

        # Include ground-truth boxes in the set of candidate rois
        all_rois = torch.cat([all_rois, gt_boxes_append], 1)
        '''
        torch.cat  
        操作前   all_rois  shape [batch_size,2000,5]    gt_boxes_append   shape   [batch_size,20,5]
        操作后   all_rois  shape [batch_size,2020,5]    2020=num_region_proposal+num_max_gt      
        '''

        num_images = 1
        rois_per_image = int(cfg.TRAIN.BATCH_SIZE / num_images)#128
        '''
        # Minibatch size (number of regions of interest [ROIs])
        __C.TRAIN.BATCH_SIZE = 128  
        # Fraction of minibatch that is labeled foreground (i.e. class > 0)
        __C.TRAIN.FG_FRACTION = 0.25   
        '''
        fg_rois_per_image = int(np.round(cfg.TRAIN.FG_FRACTION * rois_per_image))#0.25*128=32
        fg_rois_per_image = 1 if fg_rois_per_image == 0 else fg_rois_per_image#32
        '''
        对于batch size中的每张训练图像，虽然会传给Fast R-CNN模型2000个region proposal
        但是每张图像中，Fast R-CNN模型只会训练128个正样本，其中包括小于等于32个正样本
        和大于等于96个负样本，再根据rois和gt_boxes对每张图像中所有的2000个region proposal
        进行正负样本的划分，对于batch size中的每张训练图像，从所有正样本region proposal中
        随机挑选出小于等于32个（如果region proposal中正样本的数量大于32，则随机挑选出32个，
        否则就把所有的正样本进行训练），然后在batch size中的每张图像从所有负样本中随机挑选出
        （128-对于当前图像所挑选出的正样本数）作为负样本，这里所指的正负样本是用于训练
        Fast R-CNN模型的region proposal，对于每张图像界定region proposal的正负样本的标准
        要依赖于当前训练图像的ground truth bounding boxes信息  
        
        在训练RPN阶段是需要在anchor boxes预选框的基础上进行位置调整，网络需要预测的也是相对于
        anchor boxes的坐标偏移量，根据当前图像gt_boxes信息对anchor boxes进行正负样本的划分
        计算RPN的分类损失和回归损失
        在训练Fast R-CNN阶段是需要在RPN输出的2000个region proposal基础上进行位置调整和预测坐标偏移量
        根据当前图像gt_boxes信息对region proposal进行正负样本的划分
        计算Fast R-CNN的分类损失和回归损失  
        '''

        labels, rois, bbox_targets, bbox_inside_weights = self._sample_rois_pytorch(
            all_rois, gt_boxes, fg_rois_per_image,
            rois_per_image, self._num_classes)

        bbox_outside_weights = (bbox_inside_weights > 0).float()

        '''
       rois (4, 128, 5), 
       labels (4, 128),
       bbox_targets  (4, 128, 4),
       bbox_inside_weights   (4, 128, 4), 
       bbox_outside_weights   (4, 128, 4)   
        '''

        # print(rois.shape, labels.shape, bbox_targets.shape, bbox_inside_weights.shape, bbox_outside_weights.shape)

        return rois, labels, bbox_targets, bbox_inside_weights, bbox_outside_weights

    def backward(self, top, propagate_down, bottom):
        """This layer does not propagate gradients."""
        pass

    def reshape(self, bottom, top):
        """Reshaping happens during the call to forward."""
        pass

    def _get_bbox_regression_labels_pytorch(self, bbox_target_data, labels_batch, num_classes):
        """Bounding-box regression targets (bbox_target_data) are stored in a
        compact form b x N x (class, tx, ty, tw, th)

        This function expands those targets into the 4-of-4*K representation used
        by the network (i.e. only one class has non-zero targets).

        Returns:
            bbox_target (ndarray): b x N x 4K blob of regression targets
            bbox_inside_weights (ndarray): b x N x 4K blob of loss weights
        """
        batch_size = labels_batch.size(0)
        rois_per_image = labels_batch.size(1)
        clss = labels_batch
        bbox_targets = bbox_target_data.new(batch_size, rois_per_image, 4).zero_()
        bbox_inside_weights = bbox_target_data.new(bbox_targets.size()).zero_()

        for b in range(batch_size):
            # assert clss[b].sum() > 0
            if clss[b].sum() == 0:
                continue
            inds = torch.nonzero(clss[b] > 0).view(-1)
            for i in range(inds.numel()):
                ind = inds[i]
                bbox_targets[b, ind, :] = bbox_target_data[b, ind, :]
                bbox_inside_weights[b, ind, :] = self.BBOX_INSIDE_WEIGHTS

        return bbox_targets, bbox_inside_weights


    def _compute_targets_pytorch(self, ex_rois, gt_rois):
        """Compute bounding-box regression targets for an image."""

        assert ex_rois.size(1) == gt_rois.size(1)
        assert ex_rois.size(2) == 4
        assert gt_rois.size(2) == 4

        batch_size = ex_rois.size(0)
        rois_per_image = ex_rois.size(1)

        targets = bbox_transform_batch(ex_rois, gt_rois)

        if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED:
            # Optionally normalize targets by a precomputed mean and stdev
            targets = ((targets - self.BBOX_NORMALIZE_MEANS.expand_as(targets))
                        / self.BBOX_NORMALIZE_STDS.expand_as(targets))

        return targets


    def _sample_rois_pytorch(self, all_rois, gt_boxes, fg_rois_per_image, rois_per_image, num_classes):
        '''
        :param all_rois: shape [batch_size,2020,5]    2020=num_region_proposal+num_max_gt
        :param gt_boxes: torch.tensor [batch_size,20,5] 从annotation.txt文件中读取出来的坐标信息，经过尺度变换后
        :param fg_rois_per_image: 128*0.25=32
        :param rois_per_image: 128
        :param num_classes:
        :return:
        '''
        """Generate a random sample of RoIs comprising foreground and background
        examples.
        """

        # overlaps: (rois x gt_boxes)

        overlaps = bbox_overlaps_batch(all_rois, gt_boxes)
        #计算batch size中所有训练图像的region proposal与gt_boxes之间的overlap
        #overlaps  shape   [batch_size,2020,20]    2020=num_region_proposal+num_max_gt

        max_overlaps, gt_assignment = torch.max(overlaps, 2)
        '''对于batch size中的每张图像，RPN所给定的每个region proposal，遍历所有的gt_boxes
        得到当前region proposal与哪个gt_boxes的overlap最大，就认为当前的region proposal与
        gt_boxes的overlap是多少，region proposal的ground truth类别也与之对应'''

        batch_size = overlaps.size(0)
        num_proposal = overlaps.size(1)
        num_boxes_per_img = overlaps.size(2)

        offset = torch.arange(0, batch_size)*gt_boxes.size(1)
        offset = offset.view(-1, 1).type_as(gt_assignment) + gt_assignment

        labels = gt_boxes[:,:,4].contiguous().view(-1).index((offset.view(-1),)).view(batch_size, -1)
        
        labels_batch = labels.new(batch_size, rois_per_image).zero_()
        rois_batch  = all_rois.new(batch_size, rois_per_image, 5).zero_()
        gt_rois_batch = all_rois.new(batch_size, rois_per_image, 5).zero_()
        # Guard against the case when an image has fewer than max_fg_rois_per_image
        # foreground RoIs
        for i in range(batch_size):

            fg_inds = torch.nonzero(max_overlaps[i] >= cfg.TRAIN.FG_THRESH).view(-1)
            fg_num_rois = fg_inds.numel()

            # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
            bg_inds = torch.nonzero((max_overlaps[i] < cfg.TRAIN.BG_THRESH_HI) &
                                    (max_overlaps[i] >= cfg.TRAIN.BG_THRESH_LO)).view(-1)
            bg_num_rois = bg_inds.numel()

            if fg_num_rois > 0 and bg_num_rois > 0:
                # sampling fg
                fg_rois_per_this_image = min(fg_rois_per_image, fg_num_rois)
                '''
                对于batch size中的每张图像，从所有正样本region proposal挑选出128*0.25=32
                个正样本，（如果正样本的数量小于32）则将所有的正样本ROI都训练
                SSD中正负样本的比例也是1：3
                昨天看到一个帖子，训练Faster R-CNN的性能好不好关键并不在于对于RPN的训练，
                而是在于对于Fast R-CNN的训练             
                '''

                # torch.randperm seems has a bug on multi-gpu setting that cause the segfault.
                # See https://github.com/pytorch/pytorch/issues/1868 for more details.
                # use numpy instead.
                #rand_num = torch.randperm(fg_num_rois).long().cuda()
                rand_num = torch.from_numpy(np.random.permutation(fg_num_rois)).type_as(gt_boxes).long()
                fg_inds = fg_inds[rand_num[:fg_rois_per_this_image]]

                # sampling bg
                bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image

                # Seems torch.rand has a bug, it will generate very large number and make an error.
                # We use numpy rand instead.
                #rand_num = (torch.rand(bg_rois_per_this_image) * bg_num_rois).long().cuda()
                rand_num = np.floor(np.random.rand(bg_rois_per_this_image) * bg_num_rois)
                rand_num = torch.from_numpy(rand_num).type_as(gt_boxes).long()
                bg_inds = bg_inds[rand_num]

            elif fg_num_rois > 0 and bg_num_rois == 0:
                # sampling fg
                #rand_num = torch.floor(torch.rand(rois_per_image) * fg_num_rois).long().cuda()
                rand_num = np.floor(np.random.rand(rois_per_image) * fg_num_rois)
                rand_num = torch.from_numpy(rand_num).type_as(gt_boxes).long()
                fg_inds = fg_inds[rand_num]
                fg_rois_per_this_image = rois_per_image
                bg_rois_per_this_image = 0
            elif bg_num_rois > 0 and fg_num_rois == 0:
                # sampling bg
                #rand_num = torch.floor(torch.rand(rois_per_image) * bg_num_rois).long().cuda()
                rand_num = np.floor(np.random.rand(rois_per_image) * bg_num_rois)
                rand_num = torch.from_numpy(rand_num).type_as(gt_boxes).long()

                bg_inds = bg_inds[rand_num]
                bg_rois_per_this_image = rois_per_image
                fg_rois_per_this_image = 0
            else:
                raise ValueError("bg_num_rois = 0 and fg_num_rois = 0, this should not happen!")

            # The indices that we're selecting (both fg and bg)
            keep_inds = torch.cat([fg_inds, bg_inds], 0)

            # Select sampled values from various arrays:
            labels_batch[i].copy_(labels[i][keep_inds])

            # Clamp labels for the background RoIs to 0
            if fg_rois_per_this_image < rois_per_image:
                labels_batch[i][fg_rois_per_this_image:] = 0

            rois_batch[i] = all_rois[i][keep_inds]
            rois_batch[i,:,0] = i

            gt_rois_batch[i] = gt_boxes[i][gt_assignment[i][keep_inds]]

        '''
        (1)rois_batch.shape,(batch_size, 128, 5), 
        用于对Fast R-CNN训练的rois，对于batch size中的每张训练图像随机选出了128个正负样本（比例1：3）region proposal
        其中5列的第一列表示当前的region proposal是属于batch size中哪张图像的图像索引编号
        后面4列表示所挑选出来的region proposal在输入图像空间分辨率上的位置坐标值
        这128个rois就是用于训练Fast R-CNN的，其中既有正样本也有负样本，但是在Fast R-CNN，是认为RPN所传送给它的
        2000个region proposal都是很好的（考虑过一定信息的）区域候选框
        (2)labels_batch.shape,(batch_size, 128),
        用于对Fast R-CNN训练的rois，对于batch size中的每张训练图像的128个region proposal的ground truth类别
        range   (0,num_classes-1)
        (3)gt_rois_batch.shape,(batch_size, 128, 5)
        用于对Fast R-CNN训练的rois，对于batch size中的每张训练图像的128个region proposal的坐标值ground truth（编码之前）
        注意前四列表示每个region proposal对应的ground truth boxes坐标值[xmin,ymin,xmax,ymax]还是在经过尺度变换的
        输入图像的空间分辨率上，最后一列表示rois对应的ground truth类别标签  0代表背景，在我的数据集中是0-7
        '''

        bbox_target_data = self._compute_targets_pytorch(
                rois_batch[:,:,1:5], gt_rois_batch[:,:,:4])
        '''
        bbox_target_data才是对于rois_batch[batch_size,128,4]原始的在输入图像空间分辨率上面的
        region proposal与在输入图像空间分辨率上面的gt_rois_batch进行编码，注意编码后的位置偏移量是
        相当于认为现在的region proposal为anchor的编码方式，而与之前RPN anchor 的空间分辨率没有任何关系
        现在得到的编码偏移量是gt_boxes相对于region proposal的偏移量
        
        也就是说，希望Fast R-CNN模型中的regression网络部分预测的输出，其中的数值我看到有小到-3132，大到也很大
        我认为这是因为我的anchor boxes面积，宽高比等等都没有设置好，之前也想过需要对于训练数据集中的所有
        ground truth boxes的宽高比和areas进行聚类。
        bbox_target_data  shape  [batch_size,128,4]
        
        额，，我发现我这种想法错误了，之所以会出现那么大那么小的数值，是因为现在的bbox_target_data是对
        128个region proposal（无论是正样本还是负样本）都进行了位置编码，对于负样本而言，target偏移量
        自然会非常大
        
        这时的bbox_target_data是对于batch*128个region proposal全部进行位置编码，无论region proposal
        是正样本还是负样本（如果region proposal是负样本，则认为它对应的gt_boxes 是在当前输入图像中
        与它具有最大overlap的gt_boxes，然后对region proposal进行编码）
        
        需要注意的是，无论是对于RPN还是Fast R-CNN中的classifier还是regression网络，都只是经过
        1*1卷积操作（对于RPN）或者nn.Linear全连接层（Fast R-CNN），并没有再经过任何的激活函数操作
        通过观察bbox_target_data也就是希望Fast R-CNN所预测出来的坐标回归偏移量可以发现，它有很多-12，
        14这种数值，所以并不能对regression进行Relu或者sigmoid的激活函数操作，而对于classifier，由于
        在计算损失函数时使用的是F.cross_entropy，损失函数的代码实现中就包含了softmax操作和nn.NLLloss
        故而classifier部分也不需要进行激活函数的操作
        
        '''

        bbox_targets, bbox_inside_weights = \
                self._get_bbox_regression_labels_pytorch(bbox_target_data, labels_batch, num_classes)
        '''
        (1)bbox_targets   
        shape   [batch_size,128,4]   正样本的target偏移量不为0  负样本的target偏移量为0
        (2)bbox_inside_weights
        shape   [batch_size,128,4]   表示batch size的当前图像中，128个region proposal中
        哪些region proposal是正样本，哪些region proposal是负样本
        正样本为1   负样本为0
         [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],

        ……
        '''
        return labels_batch, rois_batch, bbox_targets, bbox_inside_weights

proposal_target_layer的输出

	shape	含义
labels_batch	(batch_size, 128)	用于对Fast R-CNN训练的rois，对于batch size中的每张训练图像的128个region proposal的ground truth类别 range (0,num_classes-1)
rois_batch	(batch_size, 128, 5)	用于对Fast R-CNN训练的rois，对于batch size中的每张训练图像随机选出了128个正负样本（比例1：3）region proposal，其中5列的第一列表示当前的region proposal是属于batch size中哪张图像的图像索引编号，后面4列表示所挑选出来的region proposal在输入图像空间分辨率上的位置坐标值
bbox_targets	(batch_size, 128, 4)	正样本的target偏移量不为0 负样本的target偏移量为0，这里的编码偏移量是gt_boxes相对于region proposal的偏移量
bbox_inside_weights	(batch_size, 128, 4)	表示batch size的当前图像中，128个region proposal中，哪些region proposal是正样本，哪些region proposal是负样本正样本为1 负样本为0
bbox_outside_weights	(batch_size, 128, 4)	bbox_outside_weights = (bbox_inside_weights > 0).float() 其中数值与bbox_inside_weights相同

四、对[batch_size,128,4]region proposal进行ROI Pooling

1.grid_xy = _affine_grid_gen(rois.view(-1, 5), base_feat.size()[2:], self.grid_size)

2.pooled_feat = self.RCNN_roi_crop(base_feat, Variable(grid_yx).detach())

3.pooled_feat = F.max_pool2d(pooled_feat, 2, 2)

输出的pooled_feat shape [batch_size*128,1024,7,7]

对于batch size中的每张图像，对于128个region proposal，根据它在rois_batch即输入图像空间分辨率上的位置坐标信息先映射到resnet.layer3输出的特征图上（output stride=16，output channels=1024），然后再从特征图上扣取出相应特征块（对于不同的region proposal，特征块的分辨率可能不同），然后统一变到7*7，通道数还是1024，对于batch size中的每张图像需要抠取128个7*7的特征块。

五、pooled_feat = self._head_to_tail(pooled_feat)

将从RPN和Fast R-CNN模型的shared feature map上抠取下来的特征块 pooled_feat shape [batch_size*128,1024,7,7]送入resnet.layer4进一步提取特征，得到shape [batch_size*128,2048,3,3]，然后再在这个3*3特征块上进行global average pooling，得到shape [batch_size*128,2048]，即对于batch size每张图像中的128个正负样本region proposal，得到2048维度的特征向量，然后将这个特征向量分别送入两个全连接层：classifier 从2048->num_classes(数据集中的类别数+1个背景)，regression：从2048->4，分别输出[batch_size*128,num_classes]的类别概率预测值和[batch_size*128,4]的位置偏移量预测值。

bbox_pred = self.RCNN_bbox_pred(pooled_feat)

cls_score = self.RCNN_cls_score(pooled_feat)

其中trainval_net.py中的class_agnostic参数设置成store_true，表示如果在python命令行中出现了--cag，就说明args.class_agnostic=True，如果命令行中没出现，则代表args.class_agnostic=Fasle

https://www.jianshu.com/p/fef2d215b91d

六、计算Fast R-CNN模型的损失函数

RCNN_loss_cls = F.cross_entropy(cls_score, rois_label)

RCNN_loss_bbox = _smooth_l1_loss(bbox_pred, rois_target, rois_inside_ws, rois_outside_ws)

分类同样使用交叉熵损失函数，会对batch size*128个region proposal分类结果进行平均

回归也是对于所有正样本进行计算回归损失，但是进行平均的时候还是将分母设置成batch size*128进行平均，再不像RPN那样继续加权值。

其中用到的torch.gather函数

这里有一个很小的调试细节，想和大家分享，比如对于一个torch.tensor类型的变量，我们怎么样得到tensor中共有多少个不同的数值呢？

我是这样做的，借助list类型的set方法

set(rois_outside_ws.view(-1).cpu().numpy().tolist())

WYXHAHAHA123

关注

14
点赞
踩
52

收藏

觉得还不错? 一键收藏
2
评论
物体检测 Faster R-CNN(二) 基于Resnet的Faster R-CNN网络模型

先记录下今天听到大神的言论：现在我们做的计算机视觉任务，可以分为3类，一类是降维，比如将原始输入图像降维成类别概率向量（num_classes*1），这是分类任务，第二类是输出与输入图像具有相同的分辨率（如图像去噪，风格迁移），比如使用深度学习对图像进行去噪，第三类是输出图像的分辨率将会比输入图像高，比如super resolution超分辨率，通常情况下，第一类任务在原始的数据集上就提...
复制链接

扫一扫

专栏目录