参考(这两个写得很好):
1. extract_feat
从图片中提取feature map
得到的x包括5个tuple,代表5个不同尺度的特征图,5个tuple的shape分别为:
x[0] | (2,256,16,32) |
x[1] | (2,256,8,16) |
x[2] | (2,256,4,8) |
x[3] | (2,256,2,4) |
x[4] | (2,256,1,2) |
注意:维度的第一个数代表batch_size;第二个数为通道数,是在config中确定的;第三-四个数为h和w(如何对应是不知道的),h和w是不固定的,因为输入图像尺度是不固定的。
2. RPN forward and loss
2.1 计算多尺度特征图的分类和回归输出
在base_dense_head.py函数中
def forward_train(self,
x,
img_metas,
gt_bboxes,
gt_labels=None,
gt_bboxes_ignore=None,
proposal_cfg=None,
**kwargs):
""" Args: x (list[Tensor]): Features from FPN. img_metas (list[dict]): Meta information of each image, e.g., image size, scaling factor, etc. gt_bboxes (Tensor): Ground truth bboxes of the image, shape (num_gts, 4). gt_labels (Tensor): Ground truth labels of each box, shape (num_gts,). gt_bboxes_ignore (Tensor): Ground truth bboxes to be ignored, shape (num_ignored_gts, 4). proposal_cfg (mmcv.Config): Test / postprocessing configuration, if None, test_cfg would be used Returns: tuple: losses: (dict[str, Tensor]): A dictionary of loss components. proposal_list (list[Tensor]): Proposals of each image. """
outs = self(x)
# 进行而分类,gt_label=None
if gt_labels is None:
loss_inputs = outs + (gt_bboxes, img_metas)
else:
loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
if proposal_cfg is None:
return losses
else:
proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
return losses, proposal_list
其中,self(x) 调用了 base_dense_head 的 forward 函数,然后调用 anchor_head 的 forward,然后 rpn_head 又重写了 forward_single 这个函数,所以最终是通过 rpn_head 的 forward_single 函数将每一个 FPN 后的特征图通过一个共享的卷积和两个分类回归分支得到输出。
outs包括2个tuple,第一个是rpn_cls_score,第二个是rpn_bbox_pred;每一个tuple中,有5个list,对应着x中的5个不同尺度的特征图。
outs的维度分别为:
rpn_cls_score | rpn_bbox_pred | ||
outs[0][0]: | 2,3,16,32 | outs[1][0]: | 2,12,16,32 |
outs[0][1]: | 2,3,8,16 | outs[2][0]: | 2,12,8,16 |
outs[0][2] | 2,3,4,8 | outs[3][0]: | 2,12,4,8 |
outs[0][3] | 2,3,2,4 | outs[4][0]: | 2,12,2,4 |
outs[0][4] | 2,3,1,2 | outs[5][0]: | 2,12,1,2 |
2.2. 计算损失
进的是rpn_head.py中的loss函数
def loss(self,
cls_scores,
bbox_preds,
gt_bboxes,
img_metas,
gt_bboxes_ignore=None):
"""Compute losses of the head. Args: cls_scores (list[Tensor]): Box scores for each scale level Has shape (N, num_anchors * num_classes, H, W) bbox_preds (list[Tensor]): Box energies / deltas for each scale level with shape (N, num_anchors * 4, H, W) gt_bboxes (list[Tensor]): Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format. img_metas (list[dict]): Meta information of each image, e.g., image size, scaling factor, etc. gt_bboxes_ignore (None | list[Tensor]): specify which bounding boxes can be ignored when computing the loss. Returns: dict[str, Tensor]: A dictionary of loss components. """
# 调用 anchor_head 的 loss 函数得到 box loss 和二分类 loss
losses = super(RPNHead, self).loss(
cls_scores,
bbox_preds,
gt_bboxes,
None,
img_metas,
gt_bboxes_ignore=gt_bboxes_ignore)
return dict(
loss_rpn_cls=losses['loss_cls'], loss_rpn_bbox=losses['loss_bbox'])
然后进anchor_head.py中的loss函数
def loss(self,
cls_scores,
bbox_preds,
gt_bboxes,
gt_labels,
img_metas,
gt_bboxes_ignore=None):
"""Compute losses of the head.
------涉及到anchor的生成,anchor target 的生成,损失函数----------
Args:
cls_scores (list[Tensor]): Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]): Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W) 定位框的预测值
gt_bboxes (list[Tensor]): Ground truth bboxes for each image with
shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format. 定位框的label值
gt_labels (list[Tensor]): class indices corresponding to each box
img_metas (list[dict]): Meta information of each image, e.g.,
image size, scaling factor, etc.
gt_bboxes_ignore (None | list[Tensor]): specify which bounding
boxes can be ignored when computing the loss. Default: None
Returns:
dict[str, Tensor]: A dictionary of loss components.
"""
featmap_sizes = [featmap.size()[-2:] for featmap in cls_scores]
assert len(featmap_sizes) == self.prior_generator.num_levels # 这里 num_levels 得改成3
device = cls_scores[0].device
# anchor_list 是每张图多个尺度的anchor num_imgs -> num_levels ->[num_anchors,4]
anchor_list, valid_flag_list = self.get_anchors(
featmap_sizes, img_metas, device=device)
label_channels = self.cls_out_channels if self.use_sigmoid_cls else 1
# 生成anchor的target,包括anchor的label,weight,bbox target
# cls_reg_targets 6维 tuple:
# - labels_list (list[Tensor]): Labels of each level.
# - label_weights_list (list[Tensor]): Label weights of each level.
# - bbox_targets_list (list[Tensor]): BBox targets of each level.
# - bbox_weights_list (list[Tensor]): BBox weights of each level.
# - num_total_pos (int): Number of positive samples in all images.
# - num_total_neg (int): Number of negative samples in all images.
# num_levels -> num_images ,num_anchors,4
cls_reg_targets = self.get_targets(
anchor_list,
valid_flag_list,
gt_bboxes,
img_metas,
gt_bboxes_ignore_list=gt_bboxes_ignore,
gt_labels_list=gt_labels,
label_channels=label_channels)
if cls_reg_targets is None:
return None
(labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,
num_total_pos, num_total_neg) = cls_reg_targets
num_total_samples = (
num_total_pos + num_total_neg if self.sampling else num_total_pos)
# anchor number of multi levels
num_level_anchors = [anchors.size(0) for anchors in anchor_list[0]]
# concat all level anchors and flags to a single tensor
concat_anchor_list = []
for i in range(len(anchor_list)):
concat_anchor_list.append(torch.cat(anchor_list[i]))
all_anchor_list = images_to_levels(concat_anchor_list,
num_level_anchors)
losses_cls, losses_bbox = multi_apply(
self.loss_single,
cls_scores, # 每个框的预测得分 num_levels ->N, num_anchors * num_classes, H, W
bbox_preds, # 预测框的位置 num_levels -> N, num_anchors * 4, H, W
all_anchor_list, # [batch,num_anchors,4]
labels_list, # Labels of each level [batch,num_anchors] pos,=1;neg=0
label_weights_list,
bbox_targets_list, # [batch,num_anchors,4]
bbox_weights_list,
num_total_samples=num_total_samples)
return dict(loss_cls=losses_cls, loss_bbox=losses_bbox)
首先在每个特征图上都生成一堆 anchor,然后将所有 anchor 和所有 gt 一一匹配,互相找到 IoU 最大的索引,然后确定回归和分类目标,只不过分类目标是 0 或者 1,0代表正样本,1 代表负样本(如何一一匹配?)
(1) 首先得到featmap_sizes:(16,32), (8,16), (4,8), (2,4), (1,2)
(2) 然后生成这些featmap_sizes大小的anchor_list
先按照batch_size分开
anchor_list【0】 | [1536, 4] [384, 4] [96, 4] [24, 4] [6, 4] |
anchor_list【1】 | [1536, 4] [384, 4] [96, 4] [24, 4] [6, 4] |
加起来,每张图一共生成了(2046,4)个anchor
(3) 然后生成anchor的target,包括anchor的label,weight,bbox target
最终是在anchor_head.py中的_get_targets_single函数中,对每张图分别进行处理
其中,对anchor进行正负样本分配和筛选的函数分别为:
# 对anchor进行标签分配,和gt一一对应,确定IoU最大的索引
assign_result = self.assigner.assign(
anchors, gt_bboxes, gt_bboxes_ignore,
None if self.sampling else gt_labels)
# 取样,筛选一部分正负样本进行训练
sampling_result = self.sampler.sample(assign_result, anchors,
gt_bboxes)
a. 分配
self.assigner: mmdet.core.bbox.assigners.max_iou_assigner,参数配置如下:
设定与 gt_box 的 IoU 大于 0.7 为正样本,小于 0.3 为负样本,中间的为忽略样本。
进入到max_iou_assigner.py文件中的assign函数:
overlaps = self.iou_calculator(gt_bboxes, bboxes)
assign_result = self.assign_wrt_overlaps(overlaps, gt_labels)
gt_bboxes.shape | [1, 4] | [2, 4] |
bboxes.shape | [2046, 4] | [2046, 4] |
overlaps.shape | [1, 2046] | [2, 2046] |
即 :
num_gts, num_bboxes = overlaps.size(0), overlaps.size(1)
然后分别求:
对于每一个anchor,IoU最大的标签是哪个:
max_overlaps, argmax_overlaps = overlaps.max(dim=0)
对于每一个标签,IoU最大的anchor是哪个:
gt_max_overlaps, gt_argmax_overlaps = overlaps.max(dim=1)
对于max_overlaps(【2046】,表示每一个anchor,与不同标签的交并比中最大的值,和对应的标签),当 大于 0.7 为正样本,小于 0.3 为负样本,中间的为忽略样本。
2.3 得到 proposal list
上图4展示了RPN网络的具体结构。可以看到RPN网络实际分为2条线,上面一条通过softmax分类anchors获得positive和negative分类,下面一条用于计算对于anchors的bounding box regression偏移量,以获得精确的proposal。而最后的Proposal层则负责综合positive anchors和对应bounding box regression偏移量获取proposals,同时剔除太小和超出边界的proposals。其实整个网络到了Proposal Layer这里,就完成了相当于目标定位的功能。
proposal_list = self.get_bboxes(
*outs, img_metas=img_metas, cfg=proposal_cfg) # 非极大值抑制
其中,输入的outs为:
rpn_cls_score | rpn_bbox_pred | ||
outs[0][0]: | 2,3,16,32 | outs[1][0]: | 2,12,16,32 |
outs[0][1]: | 2,3,8,16 | outs[2][0]: | 2,12,8,16 |
outs[0][2] | 2,3,4,8 | outs[3][0]: | 2,12,4,8 |
outs[0][3] | 2,3,2,4 | outs[4][0]: | 2,12,2,4 |
outs[0][4] | 2,3,1,2 | outs[5][0]: | 2,12,1,2 |
主要是在rpn_head.py中的_get_bboxes_single函数中:
该函数的作用:Transform outputs of a single image into bbox predictions.
def _get_bboxes_single(self,
cls_score_list,
bbox_pred_list,
score_factor_list,
mlvl_anchors,
img_meta,
cfg,
rescale=False,
with_nms=True,
**kwargs):
"""Transform outputs of a single image into bbox predictions.
Args:
cls_score_list (list[Tensor]): Box scores from all scale
levels of a single image, each item has shape
(num_anchors * num_classes, H, W).
bbox_pred_list (list[Tensor]): Box energies / deltas from
all scale levels of a single image, each item has
shape (num_anchors * 4, H, W).
score_factor_list (list[Tensor]): Score factor from all scale
levels of a single image. RPN head does not need this value.
mlvl_anchors (list[Tensor]): Anchors of all scale level
each item has shape (num_anchors, 4).
img_meta (dict): Image meta info.
cfg (mmcv.Config): Test / postprocessing configuration,
if None, test_cfg would be used.
rescale (bool): If True, return boxes in original image space.
Default: False.
with_nms (bool): If True, do nms before return boxes.
Default: True.
Returns:
Tensor: Labeled boxes in shape (n, 5), where the first 4 columns
are bounding box positions (tl_x, tl_y, br_x, br_y) and the
5-th column is a score between 0 and 1.
"""
cfg = self.test_cfg if cfg is None else cfg
cfg = copy.deepcopy(cfg)
img_shape = img_meta['img_shape']
# bboxes from different level should be independent during NMS,
# level_ids are used as labels for batched NMS to separate them
level_ids = []
mlvl_scores = []
mlvl_bbox_preds = []
mlvl_valid_anchors = []
nms_pre = cfg.get('nms_pre', -1)
# 对FPN的每一层进行操作:
for level_idx in range(len(cls_score_list)):
rpn_cls_score = cls_score_list[level_idx]
rpn_bbox_pred = bbox_pred_list[level_idx]
assert rpn_cls_score.size()[-2:] == rpn_bbox_pred.size()[-2:]
rpn_cls_score = rpn_cls_score.permute(1, 2, 0)
# 如果 use_sigmoid=True 的话就直接给分类预测加上 sigmoid 函数得到 score
if self.use_sigmoid_cls:
rpn_cls_score = rpn_cls_score.reshape(-1) # [16, 32, 3] ————> 16*32*3 = 1536
scores = rpn_cls_score.sigmoid()
else:
rpn_cls_score = rpn_cls_score.reshape(-1, 2)
# We set FG labels to [0, num_class-1] and BG label to
# num_class in RPN head since mmdet v2.5, which is unified to
# be consistent with other head since mmdet v2.0. In mmdet v2.0
# to v2.4 we keep BG label as 0 and FG label as 1 in rpn head.
scores = rpn_cls_score.softmax(dim=1)[:, 0]
rpn_bbox_pred = rpn_bbox_pred.permute(1, 2, 0).reshape(-1, 4) # rpn_bbox_pred [12, 16, 32] ——————> [1536, 4]
anchors = mlvl_anchors[level_idx]
if 0 < nms_pre < scores.shape[0]: # 这一段先忽略的话,就是nms_pre最多只需要2000个数,上面现在有1536个数,<2000
# sort is faster than topk
# _, topk_inds = scores.topk(cfg.nms_pre)
ranked_scores, rank_inds = scores.sort(descending=True)
topk_inds = rank_inds[:nms_pre]
scores = ranked_scores[:nms_pre]
rpn_bbox_pred = rpn_bbox_pred[topk_inds, :]
anchors = anchors[topk_inds, :]
mlvl_scores.append(scores)
mlvl_bbox_preds.append(rpn_bbox_pred)
mlvl_valid_anchors.append(anchors)
level_ids.append(
scores.new_full((scores.size(0), ),
level_idx,
dtype=torch.long))
return self._bbox_post_process(mlvl_scores, mlvl_bbox_preds,
mlvl_valid_anchors, level_ids, cfg,
img_shape)
_get_bboxes_single中,前面所做的:
把outs的输出(cls_score_list 和 bbox_pred_list)reshape,比如说,
对于单张图片,cls_score_list[0]:(3,16,32)——>(16,32,3)——> (1536),然后通过sigmoid函数,得到scores:(1536) # 1536 = 16*32*3
rpn_bbox_pred[0]:(12,16,32)——> (16,32,12) ——> (1536,4)
然后进入到_bbox_post_process函数中:
Do the nms operation for bboxes in same level.
def _bbox_post_process(self, mlvl_scores, mlvl_bboxes, mlvl_valid_anchors,
level_ids, cfg, img_shape, **kwargs):
"""bbox post-processing method.
Do the nms operation for bboxes in same level.
Args:
mlvl_scores (list[Tensor]): Box scores from all scale
levels of a single image, each item has shape
(num_bboxes, ).
mlvl_bboxes (list[Tensor]): Decoded bboxes from all scale
levels of a single image, each item has shape (num_bboxes, 4).
mlvl_valid_anchors (list[Tensor]): Anchors of all scale level
each item has shape (num_bboxes, 4).
level_ids (list[Tensor]): Indexes from all scale levels of a
single image, each item has shape (num_bboxes, ).
cfg (mmcv.Config): Test / postprocessing configuration,
if None, `self.test_cfg` would be used.
img_shape (tuple(int)): The shape of model's input image.
Returns:
Tensor: Labeled boxes in shape (n, 5), where the first 4 columns
are bounding box positions (tl_x, tl_y, br_x, br_y) and the
5-th column is a score between 0 and 1.
"""
各个输入的维度:
mlvl_scores[0] mlvl_bboxes[0] mlvl_valid_anchors[0] | 1536 1536,4 1536, 4 |
mlvl_scores[1] | 384 |
mlvl_scores[2] | 96 |
mlvl_scores[3] | 24 |
mlvl_scores[4] | 6 |
# 把多个尺度的 score 、 bbox 、 anchor 合并
scores = torch.cat(mlvl_scores)
anchors = torch.cat(mlvl_valid_anchors)
rpn_bbox_pred = torch.cat(mlvl_bboxes)
# decode
proposals = self.bbox_coder.decode(
anchors, rpn_bbox_pred, max_shape=img_shape)
ids = torch.cat(level_ids)
if cfg.min_bbox_size >= 0:
w = proposals[:, 2] - proposals[:, 0]
h = proposals[:, 3] - proposals[:, 1]
valid_mask = (w > cfg.min_bbox_size) & (h > cfg.min_bbox_size)
if not valid_mask.all():
proposals = proposals[valid_mask]
scores = scores[valid_mask]
ids = ids[valid_mask]
if proposals.numel() > 0:
dets, _ = batched_nms(proposals, scores, ids, cfg.nms)
else:
return proposals.new_zeros(0, 5)
return dets[:cfg.max_per_img]
decode函数: Apply transformation `pred_bboxes` to `boxes`.
decode函数中:
decoded_bboxes = delta2bbox(bboxes, pred_bboxes, self.means,
self.stds, max_shape, wh_ratio_clip,
self.clip_border, self.add_ctr_clamp,
self.ctr_clamp)
# 其中,bboxes为anchors,pred_bboxes为rpn_bbox_pred,即由feature map经过卷积层预测得到的bbox