Faster-RCNN深度剖析+源码debug级讲解系列（三）训练

最新推荐文章于 2022-11-06 22:32:01 发布

吸欧大王

最新推荐文章于 2022-11-06 22:32:01 发布

阅读量2.5w

点赞数 7

分类专栏：目标检测文章标签： faster-rcnn 目标检测深度学习 pytorch

本文链接：https://blog.csdn.net/weixin_36714575/article/details/115110506

版权

目标检测专栏收录该内容

12 篇文章 68 订阅

订阅专栏

前言

在之前的文章里，我们重点debug了RPN网络对于ROI的生成，详细阐述了第一次坐标调整的细节实现；以及Classifier网络对于最终结果的作用，详细阐述了第二次坐标调整的细节。这就是two-stage目标检测算法的核心。但是ROI和RPN的loss有关系吗？计算loss的时候怎么保证正负样本的比例的？loss本身是如何计算的？这些问题我们将共同在文本中结合debug去一行行的推敲。
在此之前没看过Faster-RCNN系列前两篇文章的朋友，可以看下，附上传送门：

1.RPN网络和Bbox回归

2.Classifier对ROI进行回归

源码debug

我们主要分析train.py的代码，来盘一盘训练的细节。
我们首先从main函数进去：

if __name__ == "__main__":
	#---省略部分代码---#
    model_dict = model.state_dict()
    pretrained_dict = torch.load(model_path, map_location=device)
    pretrained_dict = {k: v for k, v in pretrained_dict.items() if np.shape(model_dict[k]) ==  np.shape(v)}
    model_dict.update(pretrained_dict)
    model.load_state_dict(model_dict)

这里做了一个小检查，只有权重文件的对应参数和原始网络的参数shape一致，才会加载，是一种非严格加载的策略。大家可以日后借鉴一下这种写法。

    if Cuda:
        net = torch.nn.DataParallel(model)
        cudnn.benchmark = True
        net = net.cuda()

最简单实现多卡并行策略的方法。

    annotation_path = '2007_train.txt'
    #----------------------------------------------------------------------#
    #   验证集的划分在train.py代码里面进行
    #   2007_test.txt和2007_val.txt里面没有内容是正常的。训练不会使用到。
    #   当前划分方式下，验证集和训练集的比例为1:9
    #----------------------------------------------------------------------#
    val_split = 0.1
    with open(annotation_path) as f:
        lines = f.readlines()
    np.random.seed(10101)
    np.random.shuffle(lines)
    np.random.seed(None)
    num_val = int(len(lines)*val_split)
    num_train = len(lines) - num_val

划分训练集和验证集。
训练上整体就分为两部分，第一部分是冻结backbone的权重，加快训练速度，防止训练初期权重破坏。第二部分是解冻backbone之后进行整体的finetune。

    if True:
        lr = 1e-4
        Batch_size = 2
        Init_Epoch = 0
        Freeze_Epoch = 50
        
        optimizer = optim.Adam(net.parameters(), lr, weight_decay=5e-4)
        lr_scheduler = optim.lr_scheduler.StepLR(optimizer,step_size=1,gamma=0.95)

        train_dataset = FRCNNDataset(lines[:num_train], (input_shape[0], input_shape[1]), is_train=True)
        val_dataset   = FRCNNDataset(lines[num_train:], (input_shape[0], input_shape[1]), is_train=False)
        gen     = DataLoader(train_dataset, shuffle=True, batch_size=Batch_size, num_workers=4, pin_memory=True,
                                drop_last=True, collate_fn=frcnn_dataset_collate)
        gen_val = DataLoader(val_dataset, shuffle=True, batch_size=Batch_size, num_workers=4, pin_memory=True,
                                drop_last=True, collate_fn=frcnn_dataset_collate)
                        
        epoch_size = num_train // Batch_size
        epoch_size_val = num_val // Batch_size
        # ------------------------------------#
        #   冻结一定部分训练
        # ------------------------------------#
        for param in model.extractor.parameters():
            param.requires_grad = False

        # ------------------------------------#
        #   冻结bn层
        # ------------------------------------#
        model.freeze_bn()

        train_util = FasterRCNNTrainer(model, optimizer)

        for epoch in range(Init_Epoch,Freeze_Epoch):
            fit_ont_epoch(net,epoch,epoch_size,epoch_size_val,gen,gen_val,Freeze_Epoch,Cuda)
            lr_scheduler.step()

    if True:
        lr = 1e-5
        Batch_size = 2
        Freeze_Epoch = 50
        Unfreeze_Epoch = 100

        optimizer = optim.Adam(net.parameters(), lr, weight_decay=5e-4)
        lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

        train_dataset = FRCNNDataset(lines[:num_train], (input_shape[0], input_shape[1]), is_train=True)
        val_dataset   = FRCNNDataset(lines[num_train:], (input_shape[0], input_shape[1]), is_train=False)
        gen     = DataLoader(train_dataset, shuffle=True, batch_size=Batch_size, num_workers=4, pin_memory=True,
                                drop_last=True, collate_fn=frcnn_dataset_collate)
        gen_val = DataLoader(val_dataset, shuffle=True, batch_size=Batch_size, num_workers=4, pin_memory=True,
                                drop_last=True, collate_fn=frcnn_dataset_collate)
                        
        epoch_size = num_train // Batch_size
        epoch_size_val = num_val // Batch_size
        #------------------------------------#
        #   解冻后训练
        #------------------------------------#
        for param in model.extractor.parameters():
            param.requires_grad = True

        # ------------------------------------#
        #   冻结bn层
        # ------------------------------------#
        model.freeze_bn()

        train_util = FasterRCNNTrainer(model,optimizer)

        for epoch in range(Freeze_Epoch,Unfreeze_Epoch):
            fit_ont_epoch(net,epoch,epoch_size,epoch_size_val,gen,gen_val,Unfreeze_Epoch,Cuda)
            lr_scheduler.step()

这个训练的代码是比较流程化的，主要的调用是在fit_ont_epoch()这个函数里，而具体的loss计算等等是在train_util.train_step()这个函数里调用的forward(）函数完成的：

losses = train_util.train_step(imgs, boxes, labels, 1)

假设batch-size设置为2，原始图片是800/*800，那么这时候函数的输入的shape分别是：
imgs：（2，3，800，800）
boxes：list()里包含两个子元素分别是每个图片的bbox坐标，比如（6，4）和（8，4）。
labels：list()里包含两个子元素分别是每个图片的bbox类别，比如（6，）和（8，）

    def train_step(self, imgs, bboxes, labels, scale):
        self.optimizer.zero_grad()
        losses = self.forward(imgs, bboxes, labels, scale)
        losses.total_loss.backward()
        self.optimizer.step()
        return losses

train_step()函数指向了FasterRCNNTrainer这个类的forward()函数：

    def forward(self, imgs, bboxes, labels, scale):
        n = imgs.shape[0]
        img_size = imgs.shape[2:]
        
        # 获取公用特征层
        base_feature = self.faster_rcnn.extractor(imgs)

        # 利用rpn网络获得先验框的得分与调整参数
        rpn_locs, rpn_scores, rois, roi_indices, anchor = self.faster_rcnn.rpn(base_feature, img_size, scale)

        rpn_loc_loss_all, rpn_cls_loss_all, roi_loc_loss_all, roi_cls_loss_all = 0, 0, 0, 0
        for i in range(n):
            bbox = bboxes[i]
            label = labels[i]
            rpn_loc = rpn_locs[i]
            rpn_score = rpn_scores[i]
            roi = rois[roi_indices==i]
            feature = base_feature[i]

            # -------------------------------------------------- #
            #   利用真实框和先验框获得建议框网络应该有的预测结果
            #   给每个先验框都打上标签
            #   gt_rpn_loc      [num_anchors, 4]
            #   gt_rpn_label    [num_anchors, ]
            # -------------------------------------------------- #
            gt_rpn_loc, gt_rpn_label = self.anchor_target_creator(bbox, anchor, img_size)
            gt_rpn_loc = torch.Tensor(gt_rpn_loc)
            gt_rpn_label = torch.Tensor(gt_rpn_label).long()

            if rpn_loc.is_cuda:
                gt_rpn_loc = gt_rpn_loc.cuda()
                gt_rpn_label = gt_rpn_label.cuda()

            # -------------------------------------------------- #
            #   分别计算建议框网络的回归损失和分类损失
            # -------------------------------------------------- #
            rpn_loc_loss = _fast_rcnn_loc_loss(rpn_loc, gt_rpn_loc, gt_rpn_label, self.rpn_sigma)
            rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_label, ignore_index=-1)
  
            # ------------------------------------------------------ #
            #   利用真实框和建议框获得classifier网络应该有的预测结果
            #   获得三个变量，分别是sample_roi, gt_roi_loc, gt_roi_label
            #   sample_roi      [n_sample, ]
            #   gt_roi_loc      [n_sample, 4]
            #   gt_roi_label    [n_sample, ]
            # ------------------------------------------------------ #
            sample_roi, gt_roi_loc, gt_roi_label = self.proposal_target_creator(roi, bbox, label, self.loc_normalize_mean, self.loc_normalize_std)
            sample_roi = torch.Tensor(sample_roi)
            gt_roi_loc = torch.Tensor(gt_roi_loc)
            gt_roi_label = torch.Tensor(gt_roi_label).long()
            sample_roi_index = torch.zeros(len(sample_roi))
            
            if feature.is_cuda:
                sample_roi = sample_roi.cuda()
                sample_roi_index = sample_roi_index.cuda()
                gt_roi_loc = gt_roi_loc.cuda()
                gt_roi_label = gt_roi_label.cuda()

            roi_cls_loc, roi_score = self.faster_rcnn.head(torch.unsqueeze(feature, 0), sample_roi, sample_roi_index, img_size)

            # ------------------------------------------------------ #
            #   根据建议框的种类，取出对应的回归预测结果
            # ------------------------------------------------------ #
            n_sample = roi_cls_loc.size()[1]
            roi_cls_loc = roi_cls_loc.view(n_sample, -1, 4)
            roi_loc = roi_cls_loc[torch.arange(0, n_sample), gt_roi_label]

            # -------------------------------------------------- #
            #   分别计算Classifier网络的回归损失和分类损失
            # -------------------------------------------------- #
            roi_loc_loss = _fast_rcnn_loc_loss(roi_loc, gt_roi_loc, gt_roi_label.data, self.roi_sigma)
            roi_cls_loss = nn.CrossEntropyLoss()(roi_score[0], gt_roi_label)

            rpn_loc_loss_all += rpn_loc_loss
            rpn_cls_loss_all += rpn_cls_loss
            roi_loc_loss_all += roi_loc_loss
            roi_cls_loss_all += roi_cls_loss
            
        losses = [rpn_loc_loss_all/n, rpn_cls_loss_all/n, roi_loc_loss_all/n, roi_cls_loss_all/n]
        losses = losses + [sum(losses)]
        return LossTuple(*losses)

下面我们来重点关注这个forward()函数：

   base_feature = self.faster_rcnn.extractor(imgs)

首先调用的是faster-rcnn的extractor，也就是resnet50的stage3及以前的部分作为backbone特征提取网络。假设BatchSize=2，那么提取到的feature map的shape就是（2，1024，50，50）。

rpn_locs, rpn_scores, rois, roi_indices, anchor = self.faster_rcnn.rpn(base_feature, img_size, scale)

然后调用faster-rcnn的rpn网络，根据本系列第一篇的分析，我们再标记一下输出：

rpn_locs：rpn网络的定位偏移量全部输出，shape是（B，50*50*9，4），注意是原图尺度的。
rpn_scores：rpn网络的物体置信度输出，shape是（B，50*50*9，2）。
rois：rpn网络经过置信度排序-阈值筛选-NMS-阈值二次筛选得到的ROI，shape是（600*B，4）。
roi_indices：是记录ROI属于batch里的第几张图片的index，shape是(600*B，)

        for i in range(n):
            bbox = bboxes[i]
            label = labels[i]
            rpn_loc = rpn_locs[i]
            rpn_score = rpn_scores[i]
            roi = rois[roi_indices==i]
            feature = base_feature[i]

            # -------------------------------------------------- #
            #   利用真实框和先验框获得建议框网络应该有的预测结果
            #   给每个先验框都打上标签
            #   gt_rpn_loc      [num_anchors, 4]
            #   gt_rpn_label    [num_anchors, ]
            # -------------------------------------------------- #
            gt_rpn_loc, gt_rpn_label = self.anchor_target_creator(bbox, anchor, img_size)
            gt_rpn_loc = torch.Tensor(gt_rpn_loc)
            gt_rpn_label = torch.Tensor(gt_rpn_label).long()

按照n(batch-size维度)进行循环迭代，相当于每次取出一个图片的所有bbox坐标bbox、一一对应的所有label，对应图片的rpn输出score和loc偏移量rpn_score和rpn_loc，对应的roi输出roi，以及对应的特征图feature。

gt_rpn_loc, gt_rpn_label = self.anchor_target_creator(bbox, anchor, img_size)

这里调用了anchor_target_creator()，目的是根据gt标记rpn的label。其实是调用了AnchorTargetCreator类的__call__()函数。

class AnchorTargetCreator(object):
    def __init__(self, n_sample=256, pos_iou_thresh=0.7, neg_iou_thresh=0.3, pos_ratio=0.5):
        self.n_sample = n_sample
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh = neg_iou_thresh
        self.pos_ratio = pos_ratio

    def __call__(self, bbox, anchor, img_size):
        argmax_ious, label = self._create_label(anchor, bbox)
        if (label>0).any():
            loc = bbox2loc(anchor, bbox[argmax_ious])
            return loc, label
        else:
            return np.zeros_like(anchor), label

在__call__()函数里，首先我们调用的是_create_label()函数：

    def _create_label(self, anchor, bbox):
        # ------------------------------------------ #
        #   1是正样本，0是负样本，-1忽略
        #   初始化的时候全部设置为-1
        # ------------------------------------------ #
        label = np.empty((len(anchor),), dtype=np.int32)
        label.fill(-1)

        # ------------------------------------------------------------------------ #
        #   argmax_ious为每个先验框对应的最大的真实框的序号         [num_anchors, ]
        #   max_ious为每个真实框对应的最大的真实框的iou             [num_anchors, ]
        #   gt_argmax_ious为每一个真实框对应的最大的先验框的序号    [num_gt, ]
        # ------------------------------------------------------------------------ #
        argmax_ious, max_ious, gt_argmax_ious = self._calc_ious(anchor, bbox)
        
        # ----------------------------------------------------- #
        #   如果小于门限值则设置为负样本
        #   如果大于门限值则设置为正样本
        #   每个真实框至少对应一个先验框
        # ----------------------------------------------------- #
        label[max_ious < self.neg_iou_thresh] = 0
        label[max_ious >= self.pos_iou_thresh] = 1
        if len(gt_argmax_ious)>0:
            label[gt_argmax_ious] = 1

        # ----------------------------------------------------- #
        #   判断正样本数量是否大于128，如果大于则限制在128
        # ----------------------------------------------------- #
        n_pos = int(self.pos_ratio * self.n_sample)
        pos_index = np.where(label == 1)[0]
        if len(pos_index) > n_pos:
            disable_index = np.random.choice(pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1

        # ----------------------------------------------------- #
        #   平衡正负样本，保持总数量为256
        # ----------------------------------------------------- #
        n_neg = self.n_sample - np.sum(label == 1)
        neg_index = np.where(label == 0)[0]
        if len(neg_index) > n_neg:
            disable_index = np.random.choice(neg_index, size=(len(neg_index) - n_neg), replace=False)
            label[disable_index] = -1

        return argmax_ious, label

函数的输入bbox是gt的实际框，比如原图上只有一个gt框，那么shape就是（1，4）。
而anchor是9个anchor被均匀的放置在了原图上，shape是（22500，4）。
凭借这两个输入，我们就能为每个anchor位置打上是否包含物体的标签。首先需要对每个anchor位置计算与gt的iou。

argmax_ious, max_ious, gt_argmax_ious = self._calc_ious(anchor, bbox)

为完成这个计算，这里调用了另一个函数_calc_ious()，计算iou。

    def _calc_ious(self, anchor, bbox):
        #----------------------------------------------#
        #   anchor和bbox的iou
        #   获得的ious的shape为[num_anchors, num_gt]
        #----------------------------------------------#
        ious = bbox_iou(anchor, bbox)

        if len(bbox)==0:
            return np.zeros(len(anchor), np.int32), np.zeros(len(anchor)), np.zeros(len(bbox))
        #---------------------------------------------------------#
        #   获得每一个先验框最对应的真实框  [num_anchors, ]
        #---------------------------------------------------------#
        argmax_ious = ious.argmax(axis=1)
        #---------------------------------------------------------#
        #   找出每一个先验框最对应的真实框的iou  [num_anchors, ]
        #---------------------------------------------------------#
        max_ious = np.max(ious, axis=1)
        
        #---------------------------------------------------------#
        #   获得每一个真实框最对应的先验框  [num_gt, ]
        #---------------------------------------------------------#
        gt_argmax_ious = ious.argmax(axis=0)

        #---------------------------------------------------------#
        #   保证每一个真实框都存在对应的先验框
        #---------------------------------------------------------#
        for i in range(len(gt_argmax_ious)):
            argmax_ious[gt_argmax_ious[i]] = i

        return argmax_ious, max_ious, gt_argmax_ious

iou的计算核心代码在bbox_iou()函数：

def bbox_iou(bbox_a, bbox_b):
    if bbox_a.shape[1] != 4 or bbox_b.shape[1] != 4:
        print(bbox_a, bbox_b)
        raise IndexError
    tl = np.maximum(bbox_a[:, None, :2], bbox_b[:, :2])
    br = np.minimum(bbox_a[:, None, 2:], bbox_b[:, 2:])
    area_i = np.prod(br - tl, axis=2) * (tl < br).all(axis=2)
    area_a = np.prod(bbox_a[:, 2:] - bbox_a[:, :2], axis=1)
    area_b = np.prod(bbox_b[:, 2:] - bbox_b[:, :2], axis=1)
    return area_i / (area_a[:, None] + area_b - area_i)

假设K是GT bbox的个数，这里的输入bbox_a就是刚刚的anchor（22500，4），这里的bbox_b就是gt框(K，4)。
我们同样标记下里面每一步的shape：

tl：xmin,ymin shape是（22500，K，2）
br：xmax，ymax shape是（22500，K，2）。
area_i：每个anchor框与所有gt框交集的面积，shape是（22500，K）
area_a：每个anchor框的面积，shape是（22500，）
area_b：每个gt框的面积，shape是（K，）
area_i / (area_a[:, None] + area_b - area_i) 计算了全部的iou，shape是（22500，K）
回到_calc_ious()函数，ious的shape是（22500，K），标记着每个anchor位置和全部K个gt框的iou。

        argmax_ious = ious.argmax(axis=1)
        #---------------------------------------------------------#
        #   找出每一个先验框最对应的真实框的iou  [num_anchors, ]
        #---------------------------------------------------------#
        max_ious = np.max(ious, axis=1)
        
        #---------------------------------------------------------#
        #   获得每一个真实框最对应的先验框  [num_gt, ]
        #---------------------------------------------------------#
        gt_argmax_ious = ious.argmax(axis=0)

argmax_ious:：标记了每个anchor位置与第几个gt的iou最大，shape是（22500，）
max_ious ：标记了每个anchor位置与gt的i最大iou数值，shape是（22500，)
gt_argmax_ious ：标记了每个gt框与第几个anchor位置的iou最大，shape是（K，)

现在我们分析完了_calc_ious(）函数，我们回到_create_label()函数：

       label = np.empty((len(anchor),), dtype=np.int32)
       label.fill(-1)
       argmax_ious, max_ious, gt_argmax_ious = self._calc_ious(anchor, bbox)
        
        # ----------------------------------------------------- #
        #   如果小于门限值则设置为负样本
        #   如果大于门限值则设置为正样本
        #   每个真实框至少对应一个先验框
        # ----------------------------------------------------- #
        label[max_ious < self.neg_iou_thresh] = 0
        label[max_ious >= self.pos_iou_thresh] = 1
        if len(gt_argmax_ious)>0:
            label[gt_argmax_ious] = 1

label我们初始化为anchor位置的总数，即（22500，）
我们根据刚才每个anchor位置最大的iou，以及neg_iou_thresh和pos_iou_thresh来判断正负样本。
neg_iou_thresh的默认值是0.3，pos_iou_thresh的默认值是0.7。
所以这里的标定策略可以这样描述：

每个anchor位置与任意的gt框iou大于0.7的，我们标记为正样本；每个anchor位置与所有的gt框的iou都小于0.3的，我们标记为负样本。
此外对于每个gt框，与之有最大iou的anchor位置被标记为了正样本。

        # ----------------------------------------------------- #
        #   判断正样本数量是否大于128，如果大于则限制在128
        # ----------------------------------------------------- #
        n_pos = int(self.pos_ratio * self.n_sample)
        pos_index = np.where(label == 1)[0]
        if len(pos_index) > n_pos:
            disable_index = np.random.choice(pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1

只保留128个正样本，其余的都置为-1，也就是非正非负忽略的anchor位置。

        # ----------------------------------------------------- #
        #   平衡正负样本，保持总数量为256
        # ----------------------------------------------------- #
        n_neg = self.n_sample - np.sum(label == 1)
        neg_index = np.where(label == 0)[0]
        if len(neg_index) > n_neg:
            disable_index = np.random.choice(neg_index, size=(len(neg_index) - n_neg), replace=False)
            label[disable_index] = -1

负样本数量=预置样本数 - 正样本数，其余的都被填充为-1。这样如果正样本大于128，则正负样本都为128；如果小于128，其余都为负样本填充。

至此我们完成了整个label构建的过程。下面我们回到上层调用的地方：

    def __call__(self, bbox, anchor, img_size):
        argmax_ious, label = self._create_label(anchor, bbox)
        if (label>0).any():
            loc = bbox2loc(anchor, bbox[argmax_ious])
            return loc, label
        else:
            return np.zeros_like(anchor), label

里面调用了bbox2loc()函数来求出每个anchor位置的坐标偏移量，用于loss计算。

def bbox2loc(src_bbox, dst_bbox):
    width = src_bbox[:, 2] - src_bbox[:, 0]
    height = src_bbox[:, 3] - src_bbox[:, 1]
    ctr_x = src_bbox[:, 0] + 0.5 * width
    ctr_y = src_bbox[:, 1] + 0.5 * height

    base_width = dst_bbox[:, 2] - dst_bbox[:, 0]
    base_height = dst_bbox[:, 3] - dst_bbox[:, 1]
    base_ctr_x = dst_bbox[:, 0] + 0.5 * base_width
    base_ctr_y = dst_bbox[:, 1] + 0.5 * base_height

    eps = np.finfo(height.dtype).eps
    width = np.maximum(width, eps)
    height = np.maximum(height, eps)

    dx = (base_ctr_x - ctr_x) / width
    dy = (base_ctr_y - ctr_y) / height
    dw = np.log(base_width / width)
    dh = np.log(base_height / height)

    loc = np.vstack((dx, dy, dw, dh)).transpose()
    return loc

src_bbox是anchor位置坐标，shape是（22500，4）；dst_bbox是所有anchor位置对应的最大的框坐标，同样shape是（22500，4）。相当于为每一个anchor位置准好了anchor坐标和对应的gt坐标。
然后将二者均转换为（x_center，y_center，h，w）的表示形式。
然后求出偏移量：
在这里插入图片描述
和论文里的一模一样，不再赘述。
返回的结果loc的shape同样是（22500，4），对应的已经是偏移量了。

我们回到FasterRCNNTrainer类的forward()函数：

gt_rpn_loc, gt_rpn_label = self.anchor_target_creator(bbox, anchor, img_size)

这里我们得到的gt_rpn_loc就是上述的所有anchor位置的gt与anchor的偏移量，shape是（22500，4）；gt_rpn_label是所有anchor位置计算iou后被标记为正样本负样本和忽略样本，shape是（22500，）。

 rpn_loc_loss = _fast_rcnn_loc_loss(rpn_loc, gt_rpn_loc, gt_rpn_label, self.rpn_sigma)
 rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_label, ignore_index=-1)

rpn的loss包含两部分，一个是分类损失，就是rpn网络的分类输出与gt_rpn_label计算交叉熵损失。这里设置了ignore_index，将被标记为忽略的样本不参与loss计算。这种使用方法可以细化loss的设计，对交叉熵进行加权。
另一个是location的回归损失，调用_fast_rcnn_loc_loss函数：

def _fast_rcnn_loc_loss(pred_loc, gt_loc, gt_label, sigma):
    pred_loc = pred_loc[gt_label>0]
    gt_loc = gt_loc[gt_label>0]

    loc_loss = _smooth_l1_loss(pred_loc, gt_loc, sigma)
    num_pos = (gt_label > 0).sum().float()
    loc_loss /= torch.max(num_pos, torch.ones_like(num_pos))
    return loc_loss

再标记一下shape和参数含义：

pred_loc：rpn的location偏移量预测值，shape是（22500，4）
gt_loc：所有anchor位置与最大iou的gt框的偏移量计算值，shape是（22500，4）
gt_label：所有anchor位置的label标记，分为0代表负样本，1代表正样本，-1代表忽略样本。shape是（22500，）。

def _smooth_l1_loss(x, t, sigma):
    sigma_squared = sigma ** 2
    regression_diff = (x - t)
    regression_diff = regression_diff.abs()
    regression_loss = torch.where(
            regression_diff < (1. / sigma_squared),
            0.5 * sigma_squared * regression_diff ** 2,
            regression_diff - 0.5 / sigma_squared
        )
    return regression_loss.sum()

在这里插入图片描述
smooth L1和L1-loss函数的区别在于，L1-loss在0点处导数不唯一，可能影响收敛。smooth L1的解决办法是在0点附近使用平方函数使得它更加平滑。

以上就是stage1 rpn的loss计算，我们再来总结一下：

首先明确stage1阶段的样本划分、loss计算都和ROI没有任何关系，这里不要搞混了。
对feature map的所有anchor位置，分别与GT的所有框进行iou计算，iou>0.7被标记为正样本(表示为1)，iou小于0.3被标记为负样本(表示为0)。其余被标记为忽略样本(表示为-1)。
在所有正样本中抽样128个，其余正样本重新标定为忽略样本。如果正样本>128，则对负样本抽样128个，其余标记为忽略样本。如果正样本<128，则batchsize-正样本个数的部分均用负样本填充。所以每张图实际上产生256个样本。
rpn的loss分为两部分，一个是2分类的损失，计算的是刚刚得到的256个正负样本。另一部分是location的回归，这里同样需要求出gt相对于anchor的坐标偏移量，再与rpn的location输出进行回归损失。这里需要注意两个点：一是只计算标记为正样本的anchor位置，二是损失使用的是smooth L1 loss。

下面我们继续来看stage2 的loss计算：

            sample_roi, gt_roi_loc, gt_roi_label = self.proposal_target_creator(roi, bbox, label, self.loc_normalize_mean, self.loc_normalize_std)
            sample_roi = torch.Tensor(sample_roi)
            gt_roi_loc = torch.Tensor(gt_roi_loc)
            gt_roi_label = torch.Tensor(gt_roi_label).long()
            sample_roi_index = torch.zeros(len(sample_roi))
            
            if feature.is_cuda:
                sample_roi = sample_roi.cuda()
                sample_roi_index = sample_roi_index.cuda()
                gt_roi_loc = gt_roi_loc.cuda()
                gt_roi_label = gt_roi_label.cuda()

这里的重点是ProposalTargetCreator类实现的功能，我们来重点看下这个类：

    def __init__(self, n_sample=128, pos_ratio=0.5, pos_iou_thresh=0.5, neg_iou_thresh_high=0.5, neg_iou_thresh_low=0):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_high = neg_iou_thresh_high
        self.neg_iou_thresh_low = neg_iou_thresh_low

重点是下面的call函数：

   def __call__(self, roi, bbox, label, loc_normalize_mean=(0., 0., 0., 0.), loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        roi = np.concatenate((roi.detach().cpu().numpy(), bbox), axis=0)
        # ----------------------------------------------------- #
        #   计算建议框和真实框的重合程度
        # ----------------------------------------------------- #
        iou = bbox_iou(roi, bbox)
        
        if len(bbox)==0:
            gt_assignment = np.zeros(len(roi), np.int32)
            max_iou = np.zeros(len(roi))
            gt_roi_label = np.zeros(len(roi))
        else:
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框  [num_roi, ]
            #---------------------------------------------------------#
            gt_assignment = iou.argmax(axis=1)
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框的iou  [num_roi, ]
            #---------------------------------------------------------#
            max_iou = iou.max(axis=1)

            #---------------------------------------------------------#
            #   真实框的标签要+1因为有背景的存在
            #---------------------------------------------------------#
            gt_roi_label = label[gt_assignment] + 1

        #----------------------------------------------------------------#
        #   满足建议框和真实框重合程度大于neg_iou_thresh_high的作为负样本
        #   将正样本的数量限制在self.pos_roi_per_image以内
        #----------------------------------------------------------------#
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(self.pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(pos_index, size=pos_roi_per_this_image, replace=False)

        #-----------------------------------------------------------------------------------------------------#
        #   满足建议框和真实框重合程度小于neg_iou_thresh_high大于neg_iou_thresh_low作为负样本
        #   将正样本的数量和负样本的数量的总和固定成self.n_sample
        #-----------------------------------------------------------------------------------------------------#
        neg_index = np.where((max_iou < self.neg_iou_thresh_high) & (max_iou >= self.neg_iou_thresh_low))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image, neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(neg_index, size=neg_roi_per_this_image, replace=False)

        #---------------------------------------------------------#
        #   sample_roi      [n_sample, ]
        #   gt_roi_loc      [n_sample, 4]
        #   gt_roi_label    [n_sample, ]
        #---------------------------------------------------------#
        keep_index = np.append(pos_index, neg_index)

        sample_roi = roi[keep_index]
        if len(bbox)==0:
            return sample_roi, np.zeros_like(sample_roi), gt_roi_label[keep_index]

        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)) / np.array(loc_normalize_std, np.float32))

        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0
        return sample_roi, gt_roi_loc, gt_roi_label

首先我们先标记一下输入的参数：

roi：就是rpn的输出经过两次阈值之后的结果，shape是[600，4]。
bbox：target的坐标。shape是[K，4]，K为该图上的target数量。
label：target的标签，shape是[K,]。
loc_normalize_mean=(0., 0., 0., 0.), loc_normalize_std=(0.1, 0.1, 0.2, 0.2) 坐标归一化参数。

       iou = bbox_iou(roi, bbox)
        
        if len(bbox)==0:
            gt_assignment = np.zeros(len(roi), np.int32)
            max_iou = np.zeros(len(roi))
            gt_roi_label = np.zeros(len(roi))
        else:
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框  [num_roi, ]
            #---------------------------------------------------------#
            gt_assignment = iou.argmax(axis=1)
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框的iou  [num_roi, ]
            #---------------------------------------------------------#
            max_iou = iou.max(axis=1)

            #---------------------------------------------------------#
            #   真实框的标签要+1因为有背景的存在
            #---------------------------------------------------------#
            gt_roi_label = label[gt_assignment] + 1

对ROI和target进行iou的计算，获得每个roi和所有的target的iou值。然后求出每个roi对应的与target的iou最大的框，以及最大的框的label。
这里注意下，label对应要+1，因为0代表背景了。

        #----------------------------------------------------------------#
        #   满足建议框和真实框重合程度大于neg_iou_thresh_high的作为负样本
        #   将正样本的数量限制在self.pos_roi_per_image以内
        #----------------------------------------------------------------#
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(self.pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(pos_index, size=pos_roi_per_this_image, replace=False)

对所有的roi中最大的iou大于阈值pos_iou_thresh（默认为0.5）的取出，然后抽样出64个作为正样本。

        #   满足建议框和真实框重合程度小于neg_iou_thresh_high大于neg_iou_thresh_low作为负样本
        #   将正样本的数量和负样本的数量的总和固定成self.n_sample
        #-----------------------------------------------------------------------------------------------------#
        neg_index = np.where((max_iou < self.neg_iou_thresh_high) & (max_iou >= self.neg_iou_thresh_low))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image, neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(neg_index, size=neg_roi_per_this_image, replace=False)

max_iou在[neg_iou_thresh_low, neg_iou_thresh_high]之间的作为负样本，默认也是64个。

        keep_index = np.append(pos_index, neg_index)

        sample_roi = roi[keep_index]
        if len(bbox)==0:
            return sample_roi, np.zeros_like(sample_roi), gt_roi_label[keep_index]

经过上述的处理，得到的sample_roi的shape是[128，4]。

        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)) / np.array(loc_normalize_std, np.float32))

        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0
        return sample_roi, gt_roi_loc, gt_roi_label

再一次调用bbox2loc来获取ROI与target之间的偏移量，并进行归一化。

sample_roi, gt_roi_loc, gt_roi_label = self.proposal_target_creator(roi, bbox, label, self.loc_normalize_mean, self.loc_normalize_std)

我们回到这里再次标记一下输出：

sample_roi：是经过正负样本采样的roi，shape是[128, 4]。
gt_roi_loc：是roi与target之间计算的偏移量张量，shape是[128，4]。
gt_roi_label ：是sample_roi对应的标签，shape是（128，）。

至此我们才完成了分类网络的采样输入。我们将sample_roi送入最后的分类网络，关于分类网络在上一篇有完整的debug和解释，小伙伴没看到的可以看一下。

            roi_cls_loc, roi_score = self.faster_rcnn.head(torch.unsqueeze(feature, 0), sample_roi, sample_roi_index, img_size)

这里的两个输出就是位置和类别的回归量，shape标记为：

roi_cls_loc：shape为[1，128，84]
roi_score ：shape为[1，128，21]
21和84分别对应21个类（含背景）的分类score，与对应的位置坐标偏移量。

            n_sample = roi_cls_loc.size()[1]
            roi_cls_loc = roi_cls_loc.view(n_sample, -1, 4)
            roi_loc = roi_cls_loc[torch.arange(0, n_sample), gt_roi_label]

            # -------------------------------------------------- #
            #   分别计算Classifier网络的回归损失和分类损失
            # -------------------------------------------------- #
            roi_loc_loss = _fast_rcnn_loc_loss(roi_loc, gt_roi_loc, gt_roi_label.data, self.roi_sigma)
            roi_cls_loss = nn.CrossEntropyLoss()(roi_score[0], gt_roi_label)

stage2的loss计算如上面代码，比较简单，对于类别的loss直接进行交叉熵，对于位置回归的loss：

def _fast_rcnn_loc_loss(pred_loc, gt_loc, gt_label, sigma):
    pred_loc = pred_loc[gt_label>0]
    gt_loc = gt_loc[gt_label>0]

    loc_loss = _smooth_l1_loss(pred_loc, gt_loc, sigma)
    num_pos = (gt_label > 0).sum().float()
    loc_loss /= torch.max(num_pos, torch.ones_like(num_pos))
    return loc_loss

同样这部分是smooth L1 loss。

总结

以上我们就分析完了整个faster-rcnn的训练源码。可见这个过程本身还是较为复杂的。建议大家跟着我的教程去完整debug一遍。我觉得比读几遍论文或者去看各种视频教程更有效率。另外针对fasterrcnn的记忆点不容易牢记，我还把它常见的问题拆解成了知识点的形式，大家可以看这篇汇总的博客。

CNN知识记忆点整理

源码一行行debug实属不易，大家觉得有收获请点赞收藏支持博主谢谢！

吸欧大王

关注

7
点赞
踩
10

收藏

觉得还不错? 一键收藏
4
评论
Faster-RCNN深度剖析+源码debug级讲解系列（三）训练

前言在之前的文章里，我们重点debug了RPN网络对于ROI的生成，详细阐述了第一次坐标调整的细节实现；以及Classifier网络对于最终结果的作用，详细阐述了第二次坐标调整的细节。这就是two-stage目标检测算法的核心。但是在RPN和Classifier之间，大家也发现了每个batch的默认的样本个数从600*B变为了128*B（B表示batchsize）。在这个过程中是怎样进行抽样的？又是怎么保证正负样本的比例的？loss是如何计算的？这些问题我们将共同在文本中结合debug去一行行的推敲。在
复制链接

扫一扫

专栏目录