faster-rcnn mmdetection版注释

最新推荐文章于 2024-06-23 23:57:52 发布

草莓味的月亮先生

最新推荐文章于 2024-06-23 23:57:52 发布

阅读量134

点赞数 1

文章标签：计算机视觉深度学习人工智能

本文链接：https://blog.csdn.net/weixin_44754915/article/details/131120921

版权

参考(这两个写得很好)：

一文读懂Faster RCNN - 知乎

https://szukevin.site/2022/01/08/mmdetection%E4%B9%8BFaster-RCNN%E6%B3%A8%E9%87%8A%E8%AF%A6%E8%A7%A3/

深入浅出理解Faster R-CNN - 知乎 (zhihu.com)

1. extract_feat

从图片中提取feature map

得到的x包括5个tuple，代表5个不同尺度的特征图，5个tuple的shape分别为：

x[0]	（2，256，16，32）
x[1]	（2，256，8，16）
x[2]	（2，256，4，8）
x[3]	（2，256，2，4）
x[4]	（2，256，1，2）

注意：维度的第一个数代表batch_size；第二个数为通道数，是在config中确定的；第三-四个数为h和w(如何对应是不知道的)，h和w是不固定的，因为输入图像尺度是不固定的。

2. RPN forward and loss

2.1 计算多尺度特征图的分类和回归输出

在base_dense_head.py函数中

def forward_train(self,
                  x,
                  img_metas,
                  gt_bboxes,
                  gt_labels=None,
                  gt_bboxes_ignore=None,
                  proposal_cfg=None,
                  **kwargs):
    """ Args: x (list[Tensor]): Features from FPN. img_metas (list[dict]): Meta information of each image, e.g., image size, scaling factor, etc. gt_bboxes (Tensor): Ground truth bboxes of the image, shape (num_gts, 4). gt_labels (Tensor): Ground truth labels of each box, shape (num_gts,). gt_bboxes_ignore (Tensor): Ground truth bboxes to be ignored, shape (num_ignored_gts, 4). proposal_cfg (mmcv.Config): Test / postprocessing configuration, if None, test_cfg would be used Returns: tuple: losses: (dict[str, Tensor]): A dictionary of loss components. proposal_list (list[Tensor]): Proposals of each image. """    
    outs = self(x)
    # 进行而分类，gt_label=None     
    if gt_labels is None:
        loss_inputs = outs + (gt_bboxes, img_metas)
    else:
        loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
    losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
    if proposal_cfg is None:
        return losses
    else:
        proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
        return losses, proposal_list

其中，self(x) 调用了 base_dense_head 的 forward 函数，然后调用 anchor_head 的 forward，然后 rpn_head 又重写了 forward_single 这个函数，所以最终是通过 rpn_head 的 forward_single 函数将每一个 FPN 后的特征图通过一个共享的卷积和两个分类回归分支得到输出。

outs包括2个tuple，第一个是rpn_cls_score，第二个是rpn_bbox_pred；每一个tuple中，有5个list,对应着x中的5个不同尺度的特征图。

outs的维度分别为：

rpn_cls_score		rpn_bbox_pred
outs[0][0]：	2,3,16,32	outs[1][0]：	2,12,16,32
outs[0][1]：	2,3,8,16	outs[2][0]：	2,12,8,16
outs[0][2]	2,3,4,8	outs[3][0]：	2,12,4,8
outs[0][3]	2,3,2,4	outs[4][0]：	2,12,2,4
outs[0][4]	2,3,1,2	outs[5][0]：	2,12,1,2

2.2. 计算损失

进的是rpn_head.py中的loss函数

def loss(self,
         cls_scores,
         bbox_preds,
         gt_bboxes,
         img_metas,
         gt_bboxes_ignore=None):
    """Compute losses of the head. Args: cls_scores (list[Tensor]): Box scores for each scale level Has shape (N, num_anchors * num_classes, H, W) bbox_preds (list[Tensor]): Box energies / deltas for each scale level with shape (N, num_anchors * 4, H, W) gt_bboxes (list[Tensor]): Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format. img_metas (list[dict]): Meta information of each image, e.g., image size, scaling factor, etc. gt_bboxes_ignore (None | list[Tensor]): specify which bounding boxes can be ignored when computing the loss. Returns: dict[str, Tensor]: A dictionary of loss components. """
    # 调用 anchor_head 的 loss 函数得到 box loss 和二分类 loss     
   losses = super(RPNHead, self).loss(
        cls_scores,
        bbox_preds,
        gt_bboxes,
        None,
        img_metas,
        gt_bboxes_ignore=gt_bboxes_ignore)
    return dict(
        loss_rpn_cls=losses['loss_cls'], loss_rpn_bbox=losses['loss_bbox'])

然后进anchor_head.py中的loss函数

    def loss(self,
             cls_scores,
             bbox_preds,
             gt_bboxes,
             gt_labels,
             img_metas,
             gt_bboxes_ignore=None):
        """Compute losses of the head.
        ------涉及到anchor的生成，anchor target 的生成，损失函数----------
        Args:
            cls_scores (list[Tensor]): Box scores for each scale level
                Has shape (N, num_anchors * num_classes, H, W)
            bbox_preds (list[Tensor]): Box energies / deltas for each scale
                level with shape (N, num_anchors * 4, H, W)  定位框的预测值
            gt_bboxes (list[Tensor]): Ground truth bboxes for each image with
                shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format. 定位框的label值
            gt_labels (list[Tensor]): class indices corresponding to each box
            img_metas (list[dict]): Meta information of each image, e.g.,
                image size, scaling factor, etc.
            gt_bboxes_ignore (None | list[Tensor]): specify which bounding
                boxes can be ignored when computing the loss. Default: None

        Returns:
            dict[str, Tensor]: A dictionary of loss components.
        """
        featmap_sizes = [featmap.size()[-2:] for featmap in cls_scores]
        assert len(featmap_sizes) == self.prior_generator.num_levels  # 这里 num_levels 得改成3

        device = cls_scores[0].device

        # anchor_list 是每张图多个尺度的anchor num_imgs  -> num_levels ->[num_anchors,4]
        anchor_list, valid_flag_list = self.get_anchors(
            featmap_sizes, img_metas, device=device)
        label_channels = self.cls_out_channels if self.use_sigmoid_cls else 1
        # 生成anchor的target,包括anchor的label,weight,bbox target
        # cls_reg_targets 6维 tuple:
        #                 - labels_list (list[Tensor]): Labels of each level.
        #                 - label_weights_list (list[Tensor]): Label weights of each level.
        #                 - bbox_targets_list (list[Tensor]): BBox targets of each level.
        #                 - bbox_weights_list (list[Tensor]): BBox weights of each level.
        #                 - num_total_pos (int): Number of positive samples in all  images.
        #                 - num_total_neg (int): Number of negative samples in all images.
        # num_levels -> num_images ,num_anchors,4
        cls_reg_targets = self.get_targets(
            anchor_list,
            valid_flag_list,
            gt_bboxes,
            img_metas,
            gt_bboxes_ignore_list=gt_bboxes_ignore,
            gt_labels_list=gt_labels,
            label_channels=label_channels)
        if cls_reg_targets is None:
            return None
        (labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,
         num_total_pos, num_total_neg) = cls_reg_targets
        num_total_samples = (
            num_total_pos + num_total_neg if self.sampling else num_total_pos)

        # anchor number of multi levels
        num_level_anchors = [anchors.size(0) for anchors in anchor_list[0]]
        # concat all level anchors and flags to a single tensor
        concat_anchor_list = []
        for i in range(len(anchor_list)):
            concat_anchor_list.append(torch.cat(anchor_list[i]))
        all_anchor_list = images_to_levels(concat_anchor_list,
                                           num_level_anchors)

        losses_cls, losses_bbox = multi_apply(
            self.loss_single,
            cls_scores, #  每个框的预测得分 num_levels ->N, num_anchors * num_classes, H, W
            bbox_preds, #  预测框的位置  num_levels -> N, num_anchors * 4, H, W
            all_anchor_list, # [batch,num_anchors,4]
            labels_list, # Labels of each level [batch,num_anchors] pos,=1;neg=0
            label_weights_list,
            bbox_targets_list, # [batch,num_anchors,4]
             bbox_weights_list,
            num_total_samples=num_total_samples)
        return dict(loss_cls=losses_cls, loss_bbox=losses_bbox)

首先在每个特征图上都生成一堆 anchor，然后将所有 anchor 和所有 gt 一一匹配，互相找到 IoU 最大的索引，然后确定回归和分类目标，只不过分类目标是 0 或者 1，0代表正样本，1 代表负样本（如何一一匹配？）

(1) 首先得到featmap_sizes：(16,32), (8,16), (4,8), (2,4), (1,2)

(2) 然后生成这些featmap_sizes大小的anchor_list

先按照batch_size分开

anchor_list【0】

[1536, 4]

[384, 4]

[96, 4]

[24, 4]

[6, 4]

anchor_list【1】

[1536, 4]

[384, 4]

[96, 4]

[24, 4]

[6, 4]

加起来，每张图一共生成了（2046，4）个anchor

(3) 然后生成anchor的target，包括anchor的label，weight，bbox target

最终是在anchor_head.py中的_get_targets_single函数中，对每张图分别进行处理

其中，对anchor进行正负样本分配和筛选的函数分别为：

        # 对anchor进行标签分配，和gt一一对应，确定IoU最大的索引
        assign_result = self.assigner.assign(
            anchors, gt_bboxes, gt_bboxes_ignore,
            None if self.sampling else gt_labels)
        # 取样，筛选一部分正负样本进行训练
        sampling_result = self.sampler.sample(assign_result, anchors,
                                              gt_bboxes)

a. 分配

self.assigner： mmdet.core.bbox.assigners.max_iou_assigner，参数配置如下：

设定与 gt_box 的 IoU 大于 0.7 为正样本，小于 0.3 为负样本，中间的为忽略样本。

进入到max_iou_assigner.py文件中的assign函数：

 overlaps = self.iou_calculator(gt_bboxes, bboxes)
 assign_result = self.assign_wrt_overlaps(overlaps, gt_labels)

gt_bboxes.shape	[1, 4]	[2, 4]
bboxes.shape	[2046, 4]	[2046, 4]
overlaps.shape	[1, 2046]	[2, 2046]

即：

 num_gts, num_bboxes = overlaps.size(0), overlaps.size(1）

然后分别求：

对于每一个anchor，IoU最大的标签是哪个：

max_overlaps, argmax_overlaps = overlaps.max(dim=0)

对于每一个标签，IoU最大的anchor是哪个：

gt_max_overlaps, gt_argmax_overlaps = overlaps.max(dim=1)

对于max_overlaps（【2046】，表示每一个anchor，与不同标签的交并比中最大的值，和对应的标签），当大于 0.7 为正样本，小于 0.3 为负样本，中间的为忽略样本。

2.3 得到 proposal list

上图4展示了RPN网络的具体结构。可以看到RPN网络实际分为2条线，上面一条通过softmax分类anchors获得positive和negative分类，下面一条用于计算对于anchors的bounding box regression偏移量，以获得精确的proposal。而最后的Proposal层则负责综合positive anchors和对应bounding box regression偏移量获取proposals，同时剔除太小和超出边界的proposals。其实整个网络到了Proposal Layer这里，就完成了相当于目标定位的功能。

            proposal_list = self.get_bboxes(
                *outs, img_metas=img_metas, cfg=proposal_cfg)   # 非极大值抑制

其中，输入的outs为：

rpn_cls_score		rpn_bbox_pred
outs[0][0]：	2,3,16,32	outs[1][0]：	2,12,16,32
outs[0][1]：	2,3,8,16	outs[2][0]：	2,12,8,16
outs[0][2]	2,3,4,8	outs[3][0]：	2,12,4,8
outs[0][3]	2,3,2,4	outs[4][0]：	2,12,2,4
outs[0][4]	2,3,1,2	outs[5][0]：	2,12,1,2

主要是在rpn_head.py中的_get_bboxes_single函数中：

该函数的作用：Transform outputs of a single image into bbox predictions.

    def _get_bboxes_single(self,
                           cls_score_list,
                           bbox_pred_list,
                           score_factor_list,
                           mlvl_anchors,
                           img_meta,
                           cfg,
                           rescale=False,
                           with_nms=True,
                           **kwargs):
        """Transform outputs of a single image into bbox predictions.

        Args:
            cls_score_list (list[Tensor]): Box scores from all scale
                levels of a single image, each item has shape
                (num_anchors * num_classes, H, W).
            bbox_pred_list (list[Tensor]): Box energies / deltas from
                all scale levels of a single image, each item has
                shape (num_anchors * 4, H, W).
            score_factor_list (list[Tensor]): Score factor from all scale
                levels of a single image. RPN head does not need this value.
            mlvl_anchors (list[Tensor]): Anchors of all scale level
                each item has shape (num_anchors, 4).
            img_meta (dict): Image meta info.
            cfg (mmcv.Config): Test / postprocessing configuration,
                if None, test_cfg would be used.
            rescale (bool): If True, return boxes in original image space.
                Default: False.
            with_nms (bool): If True, do nms before return boxes.
                Default: True.

        Returns:
            Tensor: Labeled boxes in shape (n, 5), where the first 4 columns
                are bounding box positions (tl_x, tl_y, br_x, br_y) and the
                5-th column is a score between 0 and 1.
        """
        cfg = self.test_cfg if cfg is None else cfg
        cfg = copy.deepcopy(cfg)
        img_shape = img_meta['img_shape']

        # bboxes from different level should be independent during NMS,
        # level_ids are used as labels for batched NMS to separate them
        level_ids = []
        mlvl_scores = []
        mlvl_bbox_preds = []
        mlvl_valid_anchors = []
        nms_pre = cfg.get('nms_pre', -1)
        # 对FPN的每一层进行操作：
        for level_idx in range(len(cls_score_list)):
            rpn_cls_score = cls_score_list[level_idx]
            rpn_bbox_pred = bbox_pred_list[level_idx]
            assert rpn_cls_score.size()[-2:] == rpn_bbox_pred.size()[-2:]
            rpn_cls_score = rpn_cls_score.permute(1, 2, 0)
            # 如果 use_sigmoid=True 的话就直接给分类预测加上 sigmoid 函数得到 score
            if self.use_sigmoid_cls:
                rpn_cls_score = rpn_cls_score.reshape(-1) # [16, 32, 3]  ————> 16*32*3 = 1536
                scores = rpn_cls_score.sigmoid()
            else:
                rpn_cls_score = rpn_cls_score.reshape(-1, 2)
                # We set FG labels to [0, num_class-1] and BG label to
                # num_class in RPN head since mmdet v2.5, which is unified to
                # be consistent with other head since mmdet v2.0. In mmdet v2.0
                # to v2.4 we keep BG label as 0 and FG label as 1 in rpn head.
                scores = rpn_cls_score.softmax(dim=1)[:, 0]
            rpn_bbox_pred = rpn_bbox_pred.permute(1, 2, 0).reshape(-1, 4)  # rpn_bbox_pred [12, 16, 32]  ——————> [1536, 4]

            anchors = mlvl_anchors[level_idx]
            if 0 < nms_pre < scores.shape[0]:  # 这一段先忽略的话，就是nms_pre最多只需要2000个数，上面现在有1536个数，<2000
                # sort is faster than topk
                # _, topk_inds = scores.topk(cfg.nms_pre)
                ranked_scores, rank_inds = scores.sort(descending=True)
                topk_inds = rank_inds[:nms_pre]
                scores = ranked_scores[:nms_pre]
                rpn_bbox_pred = rpn_bbox_pred[topk_inds, :]
                anchors = anchors[topk_inds, :]

            mlvl_scores.append(scores)
            mlvl_bbox_preds.append(rpn_bbox_pred)
            mlvl_valid_anchors.append(anchors)
            level_ids.append(
                scores.new_full((scores.size(0), ),
                                level_idx,
                                dtype=torch.long))

        return self._bbox_post_process(mlvl_scores, mlvl_bbox_preds,
                                       mlvl_valid_anchors, level_ids, cfg,
                                       img_shape)

_get_bboxes_single中，前面所做的：

把outs的输出（cls_score_list 和 bbox_pred_list）reshape，比如说，

对于单张图片，cls_score_list[0]：（3,16,32）——>（16,32,3）——> (1536)，然后通过sigmoid函数，得到scores:（1536） # 1536 = 16*32*3

rpn_bbox_pred[0]：(12,16,32)——> (16,32,12) ——> (1536,4)

然后进入到_bbox_post_process函数中:

Do the nms operation for bboxes in same level.

    def _bbox_post_process(self, mlvl_scores, mlvl_bboxes, mlvl_valid_anchors,
                           level_ids, cfg, img_shape, **kwargs):
        """bbox post-processing method.

        Do the nms operation for bboxes in same level.

        Args:
            mlvl_scores (list[Tensor]): Box scores from all scale
                levels of a single image, each item has shape
                (num_bboxes, ).
            mlvl_bboxes (list[Tensor]): Decoded bboxes from all scale
                levels of a single image, each item has shape (num_bboxes, 4).
            mlvl_valid_anchors (list[Tensor]): Anchors of all scale level
                each item has shape (num_bboxes, 4).
            level_ids (list[Tensor]): Indexes from all scale levels of a
                single image, each item has shape (num_bboxes, ).
            cfg (mmcv.Config): Test / postprocessing configuration,
                if None, `self.test_cfg` would be used.
            img_shape (tuple(int)): The shape of model's input image.

        Returns:
            Tensor: Labeled boxes in shape (n, 5), where the first 4 columns
                are bounding box positions (tl_x, tl_y, br_x, br_y) and the
                5-th column is a score between 0 and 1.
        """

各个输入的维度：

mlvl_scores[0] mlvl_bboxes[0] mlvl_valid_anchors[0]	1536 1536,4 1536, 4
mlvl_scores[1]	384
mlvl_scores[2]	96
mlvl_scores[3]	24
mlvl_scores[4]	6

        # 把多个尺度的 score  、 bbox 、 anchor 合并
        scores = torch.cat(mlvl_scores)
        anchors = torch.cat(mlvl_valid_anchors)
        rpn_bbox_pred = torch.cat(mlvl_bboxes)

        #  decode
        proposals = self.bbox_coder.decode(
            anchors, rpn_bbox_pred, max_shape=img_shape)
        ids = torch.cat(level_ids)

        if cfg.min_bbox_size >= 0:
            w = proposals[:, 2] - proposals[:, 0]
            h = proposals[:, 3] - proposals[:, 1]
            valid_mask = (w > cfg.min_bbox_size) & (h > cfg.min_bbox_size)
            if not valid_mask.all():
                proposals = proposals[valid_mask]
                scores = scores[valid_mask]
                ids = ids[valid_mask]

        if proposals.numel() > 0:
            dets, _ = batched_nms(proposals, scores, ids, cfg.nms)
        else:
            return proposals.new_zeros(0, 5)

        return dets[:cfg.max_per_img]

decode函数： Apply transformation `pred_bboxes` to `boxes`.

decode函数中：

            decoded_bboxes = delta2bbox(bboxes, pred_bboxes, self.means,
                                        self.stds, max_shape, wh_ratio_clip,
                                        self.clip_border, self.add_ctr_clamp,
                                        self.ctr_clamp)

# 其中，bboxes为anchors,pred_bboxes为rpn_bbox_pred，即由feature map经过卷积层预测得到的bbox

草莓味的月亮先生

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
faster-rcnn mmdetection版注释

其中，self(x) 调用了 base_dense_head 的 forward 函数，然后调用 anchor_head 的 forward，然后 rpn_head 又重写了 forward_single 这个函数，所以最终是通过 rpn_head 的 forward_single 函数将每一个 FPN 后的特征图通过一个共享的卷积和两个分类回归分支得到输出。上图4展示了RPN网络的具体结构。把outs的输出（cls_score_list 和 bbox_pred_list）reshape，比如说，
复制链接

扫一扫