Faster RCNN原理与代码解析

最新推荐文章于 2023-04-13 23:43:32 发布

00000cj

最新推荐文章于 2023-04-13 23:43:32 发布

阅读量313

点赞数

分类专栏： Object Detection

本文链接：https://blog.csdn.net/ooooocj/article/details/109479745

版权

Object Detection 专栏收录该内容

43 篇文章 4 订阅

订阅专栏

本文主要根据这版keras实现https://github.com/dishen12/keras_frcnn来梳理一下Faster RCNN的流程（原作者删了这个实现，这是别人fork的）。同时这个tensorflow实现的版本https://github.com/endernewton/tf-faster-rcnn也比较清楚（个人感觉不如keras版简单），可以对照着看。

数据处理

backbone用的是vgg16，输出feature_map相对于网络输入stride=16，我们知道feature_map每个点对应9个anchor（3个ratio*3个scale），假设feature_map的宽高分别为w,h，则一共有w*h*9个anchor，函数calc_rpn通过计算anchor与gt box的iou得出哪些anchor是分类正样本（包含目标的前景，不考虑具体的类别）、哪些是负样本、哪些是忽略的、以及哪些是参与计算边框回归的。

def calc_rpn(C, img_data, width, height, resized_width, resized_height, img_length_calc_function):
    downscale = float(C.rpn_stride)  # 16
    anchor_sizes = C.anchor_box_scales  # [128, 256, 512]
    anchor_ratios = C.anchor_box_ratios  # [[1, 1], [1./math.sqrt(2), 2./math.sqrt(2)], [2./math.sqrt(2), 1./math.sqrt(2)]]
    num_anchors = len(anchor_sizes) * len(anchor_ratios)  # 9

    # calculate the output map size based on the network architecture
    (output_width, output_height) = img_length_calc_function(resized_width, resized_height)  # //16

    n_anchratios = len(anchor_ratios)  # 3

    # initialize empty output objectives
    y_rpn_overlap = np.zeros((output_height, output_width, num_anchors))
    y_is_box_valid = np.zeros((output_height, output_width, num_anchors))
    y_rpn_regr = np.zeros((output_height, output_width, num_anchors * 4))

    num_bboxes = len(img_data['bboxes'])  # 假设为2

    num_anchors_for_bbox = np.zeros(num_bboxes).astype(int)
    best_anchor_for_bbox = -1 * np.ones((num_bboxes, 4)).astype(int)
    best_iou_for_bbox = np.zeros(num_bboxes).astype(np.float32)
    best_x_for_bbox = np.zeros((num_bboxes, 4)).astype(int)
    best_dx_for_bbox = np.zeros((num_bboxes, 4)).astype(np.float32)

    # get the GT box coordinates, and resize to account for image resizing
    gta = np.zeros((num_bboxes, 4))
    for bbox_num, bbox in enumerate(img_data['bboxes']):
        # get the GT box coordinates, and resize to account for image resizing
        gta[bbox_num, 0] = bbox['x1'] * (resized_width / float(width))
        gta[bbox_num, 1] = bbox['x2'] * (resized_width / float(width))
        gta[bbox_num, 2] = bbox['y1'] * (resized_height / float(height))
        gta[bbox_num, 3] = bbox['y2'] * (resized_height / float(height))

    # rpn ground truth
    for anchor_size_idx in range(len(anchor_sizes)):  # 3
        for anchor_ratio_idx in range(n_anchratios):  # 3
            anchor_x = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][0]
            anchor_y = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][1]

            for ix in range(output_width):
                # x-coordinates of the current anchor box
                x1_anc = downscale * (ix + 0.5) - anchor_x / 2
                x2_anc = downscale * (ix + 0.5) + anchor_x / 2

                # ignore boxes that go across image boundaries
                if x1_anc < 0 or x2_anc > resized_width:
                    continue

                for jy in range(output_height):
                    # y-coordinates of the current anchor box
                    y1_anc = downscale * (jy + 0.5) - anchor_y / 2
                    y2_anc = downscale * (jy + 0.5) + anchor_y / 2

                    # ignore boxes that go across image boundaries
                    if y1_anc < 0 or y2_anc > resized_height:
                        continue

                    # bbox_type indicates whether an anchor should be a target
                    bbox_type = 'neg'

                    # this is the best IOU for the (x,y) coord and the current anchor
                    # note that this is different from the best IOU for a GT bbox
                    best_iou_for_loc = 0.0  # one of two

                    for bbox_num in range(num_bboxes):
                        # get IOU of the current GT box and the current anchor box
                        curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]],
                                       [x1_anc, y1_anc, x2_anc, y2_anc])
                        # calculate the regression targets if they will be needed
                        if curr_iou > best_iou_for_bbox[bbox_num] or curr_iou > C.rpn_max_overlap:
                            cx = (gta[bbox_num, 0] + gta[bbox_num, 1]) / 2.0
                            cy = (gta[bbox_num, 2] + gta[bbox_num, 3]) / 2.0
                            cxa = (x1_anc + x2_anc) / 2.0
                            cya = (y1_anc + y2_anc) / 2.0

                            tx = (cx - cxa) / (x2_anc - x1_anc)
                            ty = (cy - cya) / (y2_anc - y1_anc)
                            tw = np.log((gta[bbox_num, 1] - gta[bbox_num, 0]) / (x2_anc - x1_anc))
                            th = np.log((gta[bbox_num, 3] - gta[bbox_num, 2]) / (y2_anc - y1_anc))

                        if img_data['bboxes'][bbox_num]['class'] != 'bg':
                            # all GT boxes should be mapped to an anchor box,
                            # so we keep track of which anchor box was best
                            if curr_iou > best_iou_for_bbox[bbox_num]:
                                best_anchor_for_bbox[bbox_num] = [jy, ix, anchor_ratio_idx, anchor_size_idx]  # 由此可以得到anchor的坐标
                                best_iou_for_bbox[bbox_num] = curr_iou
                                best_x_for_bbox[bbox_num, :] = [x1_anc, x2_anc, y1_anc, y2_anc]  # anchor的坐标 好像用不着？
                                best_dx_for_bbox[bbox_num, :] = [tx, ty, tw, th]

                            # we set the anchor to positive if the IOU is > 0.7
                            # (it does not matter if there was another better box, it just indicates overlap)
                            if curr_iou > C.rpn_max_overlap:
                                bbox_type = 'pos'
                                num_anchors_for_bbox[bbox_num] += 1
                                # we update the regression layer target if this IOU
                                # is the best for the current (x,y) and anchor position
                                if curr_iou > best_iou_for_loc:
                                    best_iou_for_loc = curr_iou
                                    best_regr = (tx, ty, tw, th)

                            # if the IOU is > 0.3 and < 0.7, it is ambiguous and no included in the objective
                            if C.rpn_min_overlap < curr_iou < C.rpn_max_overlap:
                                # gray zone between neg and pos
                                if bbox_type != 'pos':
                                    bbox_type = 'neutral'

                    # turn on or off outputs depending on IOUs
                    if bbox_type == 'neg':
                        y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
                        y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
                    elif bbox_type == 'neutral':
                        y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
                        y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
                    elif bbox_type == 'pos':
                        y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
                        y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1

                        start = 4 * (anchor_ratio_idx + n_anchratios * anchor_size_idx)
                        y_rpn_regr[jy, ix, start:start + 4] = best_regr

    # we ensure that every bbox has at least one positive RPN region
    for idx in range(num_anchors_for_bbox.shape[0]):
        if num_anchors_for_bbox[idx] == 0:
            # no box with an IOU greater than zero ...
            if best_anchor_for_bbox[idx, 0] == -1:
                continue
            y_is_box_valid[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1],
                           best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3]] = 1
            y_rpn_overlap[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1],
                          best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3]] = 1
            start = 4 * (best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3])
            y_rpn_regr[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1], start:start + 4] \
                = best_dx_for_bbox[idx, :]

    y_rpn_overlap = np.transpose(y_rpn_overlap, (2, 0, 1))
    y_rpn_overlap = np.expand_dims(y_rpn_overlap, axis=0)

    y_is_box_valid = np.transpose(y_is_box_valid, (2, 0, 1))
    y_is_box_valid = np.expand_dims(y_is_box_valid, axis=0)

    y_rpn_regr = np.transpose(y_rpn_regr, (2, 0, 1))
    y_rpn_regr = np.expand_dims(y_rpn_regr, axis=0)

    pos_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 1, y_is_box_valid[0, :, :, :] == 1))
    neg_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 0, y_is_box_valid[0, :, :, :] == 1))

    num_pos = len(pos_locs[0])

    # one issue is that the RPN has many more negative than positive regions, so we turn off some of the negative
    # regions. We also limit it to 256 regions.
    num_regions = 256

    if len(pos_locs[0]) > num_regions / 2:
        val_locs = random.sample(range(len(pos_locs[0])), len(pos_locs[0]) - num_regions / 2)
        y_is_box_valid[0, pos_locs[0][val_locs], pos_locs[1][val_locs], pos_locs[2][val_locs]] = 0
        num_pos = num_regions / 2

    if len(neg_locs[0]) + num_pos > num_regions:
        val_locs = random.sample(range(len(neg_locs[0])), len(neg_locs[0]) - num_pos)
        y_is_box_valid[0, neg_locs[0][val_locs], neg_locs[1][val_locs], neg_locs[2][val_locs]] = 0

    y_rpn_cls = np.concatenate([y_is_box_valid, y_rpn_overlap], axis=1)
    y_rpn_regr = np.concatenate([np.repeat(y_rpn_overlap, 4, axis=1), y_rpn_regr], axis=1)

    return np.copy(y_rpn_cls), np.copy(y_rpn_regr)

首先遍历输出feature_map上的每个点，乘以步长16映射回原始输入得到anchor的中心点，再遍历9种不同大小和宽高比的anchor，每个anchor与所有的gt box计算iou，根据iou与预先设定的thresh来判定anchor的类别。具体的规则如下：

忽略超出图片边界的anchor
分类：iou>0.7的anchor为正样本，0.3<iou<0.7的为忽略样本，iou<0.3的为负样本。
回归：只有iou>0.7的anchor才参与回归计算，若一个anchor和多个gt box的iou都大于0.7，取iou最大的gt box计算回归。若一个gt box和所有anchor的iou都小于0.7，则取iou最大的那个anchor计算回归（除非和所有anchor的iou都等于0）
因为负样本的数量远大于正样本，论文限制总样本数量为256，若正样本数大于128，则限制其为128，负样本为128，其余的忽略。若正样本数小于128，则保留所有正样本，负样本数为256-正样本数，其余的忽略。

RPN

假设网络输入的shape是(1,900,600,3)，即(batch_size, w, h, channel)，则经过vgg16的backbone输出的feature_map的shape为(1,56,38,512)

def rpn(base_layers, num_anchors):
    x = Conv2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)

    x_class = Conv2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)
    x_regr = Conv2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)

    return [x_class, x_regr, base_layers]

代码里的base_layers即为backbone输出的feature_map，num_anchors=9，RPN的输出x_class的shape为(1,56,38,9)，x_regr的shape为(1,56,38,36)，即每个anchor的类别score和边框回归值。RPN的作用就是从数量众多的anchor（56*38*9=19152）中选出少量可能有目标的anchor即proposal，注意这里的分类只分前景和背景，并不分具体的类别，网络到这里即为第一阶段。在训练阶段RPN结束后会进行两步操作，一是与calc_rpn的输出进行loss的计算，其中分类是交叉熵loss，回归是smooth L1 loss。二是会从所有anchor中选出proposal，具体做法是首先根据RPN输出的回归值将anchor回归到真实预测框的坐标，对回归的结果进行裁剪，保证预测框在图内，并删除回归结果不合理的框（例如左上x坐标大于右下的）。然后根据分类score做nms，挑选出300个更有可能包含目标的候选框（结果可能会小于300）。

R = roi_helpers.rpn_to_roi(P_rpn[0], P_rpn[1], C, K.image_dim_ordering(), use_regr=True, overlap_thresh=0.7, max_boxes=300)  # _proposal_layer
# 先用P_rpn[1]在特征图上做回归，再根据P_rpn[0]做nms

得到的300个proposal再和gt box计算iou，去掉iou<0.1即容易分类的背景，0.1<iou<0.5的类别标为背景，iou>0.5的类别标为具体的类别（注意和RPN的GT标注不同，那里只标正负样本），并再一次计算和gt box边框的差值。这里标的具体类别以及和gt box的差值即是第二阶段最终网络输出分类和回归的target。

X2, Y1, Y2, IouS = roi_helpers.calc_iou(R, img_data, C, class_mapping)  # proposal_target_layer
# 去掉容易分类的背景，即和GT的IOU小于0.1的。并又和GT做了一次回归的差值
# (1,245,4), (1,245,21), (1,245,160),

ROI Pooling

这里是第二阶段的开始，在ROI Pooling之前，网络会再次限制参与计算的roi(proposal)数量，代码设定为32，和第一阶段一样，若正样本大于16则随机取16个，若小于16则全取，负样本随机取32-正样本个数。因为最终要用全连接层进行分类和回归，而全连接层的输入要固定大小，roi pooling的作用就是将前面挑选出的不同大小的roi统一固定大小。具体做法是首先取32个roi在输出feature_map上的对应区域，每个区域平分成pool_w*pool_h块，论文中pool_w=pool_h=7，然后每个小块做max_pooling，即将不同大小的roi都变成了7*7大小，然后接全连接层得到最终的分类预测和边框回归预测。