mask rcnn 超详细代码解读（二）

最新推荐文章于 2022-05-23 21:46:08 发布

Cleo_Gao

最新推荐文章于 2022-05-23 21:46:08 发布

阅读量4k

点赞数 12

分类专栏： mask r-cnn代码解读文章标签： python 深度学习 tensorflow 计算机视觉

本文链接：https://blog.csdn.net/cleo_gao/article/details/116133308

版权

mask r-cnn代码解读专栏收录该内容

3 篇文章 11 订阅

订阅专栏

前文： mask rcnn 超详细代码解读（一）

文章目录

1 解析(一)中网络结构总结
2 train过程代码继续解析
- 2.1 ROIAlign Layer
- 2.2 Detection Target Layer
3 关于代码中用到的索引
- 示例一
- 示例二

1 解析(一)中网络结构总结

（一）中解析了Resnet Graph、Region Proposal Network (RPN)、Proposal Layer三个部分。（MaskRCNN Class 层会把大家都关联起来）

Resnet Graph是一系列的卷积，它的目的就是提取特征。图片输入网络，首先通过Resnet Graph提取特征，得到 [C1, C2, C3, C4, C5]，这些特征是后面网络的基础。

在后文的 MaskRCNN Class 层解析中会发现， Resnet Graph得到的C系列特征分别处理后得到P系列特征，即 [P1, P2, P3, P4, P5]，然后 P5 再通过maxpooling得到P6， [P1, P2, P3, P4, P5, P6] 作为 feature_map输入 Region Proposal Network (RPN) 得到 rpn_class_logits 、 rpn_class 、 rpn_bbox 。

这些输出结果再输入 Proposal Layer 就可以得到 proposals 了。
解析一中三部分的联系

以上就是（一）中所解析的三部分的关联（如上图），接下来将继续分析 ROIAlign Layer 、Detection Target Layer、Feature Pyramid Network Heads 这三层的结构。

2 train过程代码继续解析

2.1 ROIAlign Layer

ROIAlign 是最不好理解的一部分，代码中ROIAlign包括两个部分：

定义函数 def log2_graph(x) ：这个是因为 TensorFlow 中竟然没有求 $log_2x$ 的方法，所以代码自己定义了一个方法来计算，直接返回 tf.log(x) / tf.log(2.0)
定义类 class PyramidROIAlign(KE.Layer) ：同样继承 KE.Layer，目的是让TensorFlow 处理的数据流可以让 keras 接着处理，具体前文已有说明，这里不再赘述。

下文对 PyramidROIAlign 类进行解析。

首先是 __init__ 方法：

    def __init__(self, pool_shape, **kwargs):
        super(PyramidROIAlign, self).__init__(**kwargs)
        self.pool_shape = tuple(pool_shape)

由源码可知，在实例化PyramidROIAlign 类时，需要传入一个 pool_shape 参数。这个参数非常重要，它决定了 ROIAlign 层输出的特征的shape，一般 pool_shape=(7, 7)，也就是说，不管输入特征的大小是多少，输出特征大小必然是 7x7（不考虑通道数）。

这点非常重要。 因为 mask rcnn 的设定是，可以输入任意尺寸的图片。对于卷积来说，该层的参数量 = 卷积核高x卷积核宽x卷积核数量（通道数），其中卷积核的高和宽是设定的参数，通道数是超参数，输入图片大小不会影响卷积层的参数量，只是输出的特征大小不同罢了，不管输入多大的图片都能算（也就是不会报错）。

但是对于dense层来说，输入图片大小不同，参数量是不一样的。在分类时，网络最后要接 dense 层，要确保输入 dense 的 feature 大小一致。但是输入mask rcnn 的图片大小又是不确定的，那该怎么办呢？？？

所以，这就是 PyramidROIAlign 的重要作用了：不管输入该层的特征大小为多少，经过该层之后，一律变成固定值（即 pool_size ，一般是 7x7）。核心科技就是调用了这个方法：tf.crop_and_resize（另外，不是把整张输入图片的特征变成 7x7 ，如果是那样就只有resize没有corp了。 PyramidROIAlign 的功能是，根据显著性物体的bbox坐标，以及显著性物体相对于整张图片面积的大小，在不同尺寸的特征图上切出显著性对象的特征。可以结合代码理解这个过程。）

来看 PyramidROIAlign 具体是怎么做的，也就是 call(self, inputs) 方法。

输入：

boxes : [batch, num_boxes, (y1, x1, y2, x2)] 其中坐标是归一化的
image_meta : [batch, (meta data)] 储存了图片的一些原始信息，之前（一）已经说明过
feature_maps : 特征金字塔，每个的shape都是 [batch, height, width, channels]

输出：

pool 后固定大小的特征：[batch, num_boxes, pool_height, pool_width, channels]

代码流程：

（1）初始化，从 input 中获取 bboxes 、image_meta 、feature_maps：

    def call(self, inputs):
        # num_boxes指的是proposal数目
        # 通过循环特征层寻找符合的proposal，应用于ROIAlign
        # Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords
        boxes = inputs[0]
        print('boxes:',boxes)

        # Image meta
        # Holds details about the image. See compose_image_meta()
        image_meta = inputs[1]

        # Feature Maps. List of feature maps from different level of the
        # feature pyramid. Each is [batch, height, width, channels]
        feature_maps = inputs[2:]

其中：

boxes：shape = [batch, num_boxes, (y1, x1, y2, x2)]，这里坐标都经过了归一化处理。
input_meta：里面包含了各种图片信息，包括原输入图片的大小、图片id之类的（虽然只有 image_shape 会用到。。。）这个是通过 compose_image_meta 方法生成的，可以用 parse_image_meta(meta) 获得meta中的数据，这两个方法在解读（一）中已经说明。
feature_maps：是通过Resnet Graph提取到的特征，每个的shape都是[batch, height, width, channels]

什么？你问这个参数怎么传进去的，当然是：

layer = PyramidROIAlign(7,7)([bboxes, image_meta, feature_maps])

（2）根据 image_meta 中携带的原图面积信息，得到现在处理的这张图片应该在哪一个特征图中 pooling。

    def call(self, inputs):
        # （1）初始化，从 `input` 中获取 bboxes 、image_meta 、feature_maps
		... # 初始化代码省略
        # Assign each ROI to a level in the pyramid based on the ROI area.
        # 这里的boxes是ROI的框，用来计算得到每个ROI框的面积
        y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
        h = y2 - y1  # h.shape=[batch,num_boxes,1]
        w = x2 - x1
        # Use shape of first image. Images in a batch must have the same size.
        # 这里得到原图的尺寸，计算原图的面积
        image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
        # Equation 1 in the Feature Pyramid Networks paper. Account for
        # the fact that our coordinates are normalized here.
        # e.g. a 224x224 ROI (in pixels) maps to P4
        # 原图面积
        image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
        # 分成两步计算每个ROI框需要在哪个层的特征图中进行pooling
        roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area))) # h,w已经归一化
        roi_level = tf.minimum(5, tf.maximum(
            2, 4 + tf.cast(tf.round(roi_level), tf.int32))) # 确保值位于2-5之间
        roi_level = tf.squeeze(roi_level, 2)  # roi_level.shape=[batch,num_boxes,1]

这里增加一点解释：

为啥要计算 roi_level ？

roi_level （记为k）的计算方法是： $k=k_0+log_2(\frac{\sqrt{w*h}}{244})$ 这里 w 和 h 分别是显著性物体的绑定框的宽和高，所以 w*h 是显著性物体的大小。244是预训练的 Image Net 的输入大小，比如 $k_0$ =4，那么，w*h=244时，k=4，该显著性对象的特征从特征金字塔中的 P4 中 crop。

如果显著性物体占原图面积大，则在更“深”（也就是卷积次数更多）的特征图（比如P5）上切割，如果显著性物体是个不起眼的小东西，比如 $k_0$ =4，w*h=112，则 k=3，小的显著性物体在更“浅”的特征图上切割（比如P3）。这样有利于检测不同尺寸的目标。

计算ROI在哪个特征图中进行Pooling的结果储存在 roi_level 里面的，roi_level.shape=[batch,num_boxes,1]

（3）循环 feature_maps，在feature_maps中用 tf.image.crop_and_resize 函数得到 pooled，存入list：

    def call(self, inputs):
        #（1）初始化，从 `input` 中获取 bboxes 、image_meta 、feature_maps
		... # 初始化代码省略
        #（2）根据 image_meta 中携带的原图面积信息，得到现在处理的这张图片应该在哪一个特征图中 pooling 
        ... # 代码省略

        # Loop through levels and apply ROI pooling to each. P2 to P5.
        # 使用得到的5个融合了不同层级的特征图
        pooled = []
        box_to_level = []  # box_to_level[i, 0]表示的是当前feat隶属的图片索引，box_to_level[i, 1]表示的是其box序号
        for i, level in enumerate(range(2, 6)):  # 只使用2-5四个特征图
            # 先找出需要在第level层计算ROI
            # tf.where 返回格式 [坐标1， 坐标1...]
            # np.where 返回格式 [[坐标1.x, 坐标2.x...], [坐标1.y, 坐标2.y...]]
            # 返回第n张图片的第i个proposal坐标（n对应batch坐标，i对应num_boxes那一维的坐标）
            ix = tf.where(tf.equal(roi_level, level))  # ix是一个坐标集，每个坐标有三个数字，第三位数必然是0（因为roi_level.shape=[batch,num_boxes,1]）。
            # level_boxes 记录对应的level特征层中分配到的每个box的坐标（候选框索引对应的图片）
            # box_indices 记录每个box对应的图片在batch中的索引（候选框索引对应其坐标即小黑框的坐标）
            level_boxes = tf.gather_nd(boxes, ix)  # [本level的proposal数目，4]

            # Box indices for crop_and_resize.
            box_indices = tf.cast(ix[:, 0], tf.int32)  # 记录每个proposal对应图片序号
            # ↑ 取 ix[:,0]是tf.image.crop_and_resize传参需要

            # Keep track of which box is mapped to which level
            box_to_level.append(ix)

            # Stop gradient propogation to ROI proposals
            # level_boxes和box_indices本身属于RPN计算出来结果，
            # 但是两者作用于feature后的输出Tensor却是RCNN部分的输入，
            # 两部分的梯度不能相互流通的，所以需要tf.stop_gradient()截断梯度传播。
            level_boxes = tf.stop_gradient(level_boxes)
            box_indices = tf.stop_gradient(box_indices)

            # Crop and Resize
            # From Mask R-CNN paper: "We sample four regular locations, so
            # that we can evaluate either max or average pooling. In fact,
            # interpolating only a single value at each bin center (without
            # pooling) is nearly as effective."
            #
            # Here we use the simplified approach of a single value per bin,
            # which is how it's done in tf.crop_and_resize()
            # Result: [batch * num_boxes, pool_height, pool_width, channels]
            # 调用API双线性插值
            # tf.image.crop_and_resize的参数说明：
            #   - image: 表示特征图
            #   - boxes：指需要划分的区域，输入格式为[ymin，xmin，ymax，xmax] 归一化
            #   - box_ind: 是boxes和image之间的索引,形状为[num_boxes]的1维张量,box_ind[i]值指定第i个方框要引用的图像
            #   - crop_size: 表示RoiAlign之后的大小
            pooled.append(tf.image.crop_and_resize(
                feature_maps[i], level_boxes, box_indices, self.pool_shape,
                method="bilinear"))

        # 输入参数shape:
        # [batch, image_height, image_width, channels]
        # [this_level_num_boxes, 4]
        # [this_level_num_boxes]
        # [height, pool_width]

        # Pack pooled features into one tensor
        # 对每个box，都提取其中每一层特征图上该box对应的特征，然后组成一个大的特征表pooled
        pooled = tf.concat(pooled, axis=0)

        # Pack box_to_level mapping into one array and add another
        # column representing the order of pooled boxes
        box_to_level = tf.concat(box_to_level, axis=0)
        box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)
        box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range],
                                 axis=1)

关于 tf.image.crop_and_resize 这个关键函数的补充说明：这个函数会先按输入参数 [ymin，xmin，ymax，xmax] 在图上通过索引切出一部分，然后把这部分resize成你想要的大小，比如：
tf.corp_and_resize说明
另外，索引那段代码（就是 ix 有关的那段代码）不好理解，可以看本文第三部分索引详解的示例一（讲道理不理解也行，不影响理解整个 mask rcnn 的代码思路，但是理解了有助于以后自己写代码使用索引）

(4)调整shape顺序，得到形如 [batch, num_bbox, pool_height, pool_width, channels]的输出;

    def call(self, inputs):
        ... #（1）（2）（3）代码省略 

        # 截止到目前，我们获取了记录全部ROIAlign结果feat集合的张量pooled，和记录这些feat相关信息的张量box_to_level，
        # 由于提取方法的原因，此时的feat并不是按照原始顺序排序（先按batch然后按box index排序）
        # 下面我们设法将之恢复顺序（ROIAlign作用于对应图片的对应proposal生成feat）
        # Rearrange pooled features to match the order of the original boxes
        # Sort box_to_level by batch then box index
        # TF doesn't have a way to sort by two columns, so merge them and sort.
        # box_to_level[i, 0]表示的是当前feat隶属的图片索引，box_to_level[i, 1]表示的是其box序号
        sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]
        ix = tf.nn.top_k(sorting_tensor, k=tf.shape(
            box_to_level)[0]).indices[::-1]
        ix = tf.gather(box_to_level[:, 2], ix)
        pooled = tf.gather(pooled, ix)

        # Re-add the batch dimension
        shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0)
        pooled = tf.reshape(pooled, shape)
        return pooled

2.2 Detection Target Layer

Detection Target Layer 的输入（ gt 指 ground truth）：

proposals: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] 坐标是归一化的，如果该图片生成的实际 proposal 数量不足，会补零到固定值
gt_class_ids: [MAX_GT_INSTANCES] int class IDs
gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] 坐标是归一化的
gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.

rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] 坐标是归一化的
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. 数量不足会补零到固定值。
deltas: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))]
masks: [TRAIN_ROIS_PER_IMAGE, height, width]. 这些 mask 是 cropped 成对应的 bbox 框并且 resized 到网络输出大小的掩码。

有三个部分：

overlaps_graph(boxes1, boxes2) 方法：计算两个box之间重叠的部分，也就是IoU值。这部分代码简单，略过。
detection_targets_graph 方法：detection的主要处理流程
DetectionTargetLayer 类

注意这部分是没有可训练参数的（也就是没有卷积操作，可训练参数=0）
该层的目的是根据 proposals 的坐标和标注的数据，计算得到 rois 坐标、proposals的坐标偏离值 deltas、掩码。

可能下面的代码比较绕，看得人都麻了，建议先看本文第三部分的 “索引解释” ，熟练掌握 tf.where 和 tf.gather 和 tf.gather_nd 的用法（可以参考这篇博客）

下面以计算 delta 为例，画一个代码计算思路图，配合代码。别的计算就同理。
delta计算思路图
detection_targets_graph 代码具体实现流程：
（1）remove zero padding，去掉 gt_class_ids 和 gt_masks、proposals、gt_boxes中的0（gt是 ground truth 的简写）

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    """Generates detection targets for one image. Subsamples proposals and
    generates target class IDs, bounding box deltas, and masks for each.

    Inputs:
    proposals: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] in normalized coordinates. Might
               be zero padded if there are not enough proposals.
    gt_class_ids: [MAX_GT_INSTANCES] int class IDs
    gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates.
    gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.

    Returns: Target ROIs and corresponding class IDs, bounding box shifts,
    and masks.
    rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
    class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
    deltas: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))]
    masks: [TRAIN_ROIS_PER_IMAGE, height, width]. Masks cropped to bbox
           boundaries and resized to neural network output size.

    Note: Returned arrays might be zero padded if not enough target ROIs.
    """
    # Assertions
    asserts = [
        tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
                  name="roi_assertion"),
    ]
    with tf.control_dependencies(asserts):
        proposals = tf.identity(proposals)

    # Remove zero padding
    proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
    gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
    gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
                                   name="trim_gt_class_ids")
    gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
                         name="trim_gt_masks")

（2）处理 crowds (a crowd refers to a bounding box around several instances)，用 tf.where 得到 crowd_id，然后用 tf.gather 得到 crowd_boxes，以及用 non_crowd_ix 得到 gt_class_id、gt_boxes、gt_masks

（在代码中区分 crowd 和 non_crowd 的方法是：gt_class_id=0 是 crowd；gt_class_id>0 是 non_crowd）

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    ...  # 代码省略
    
    # Handle COCO crowds
    # A crowd box in COCO is a bounding box around several instances. Exclude
    # them from training. A crowd box is given a negative class ID.
    # crowd_ix 是 gt_class_id=0 的位置
    crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
    non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
    crowd_boxes = tf.gather(gt_boxes, crowd_ix)
    gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
    gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
    gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

其中 tf.gather 的用法：tf.gather(params,indices,axis=0 )，从params的axis维根据indices的参数值获取切片。这里 indices 通过 tf.where 得到。这个函数在下文的 “索引示例” 中会用到，可以结合索引示例理解。

补充 tf.where 的用法说明及示例：

tf.where(condition, x=None, y=None, name=None)
# condition， x, y 相同维度，condition是bool型值
# 返回condition中元素为True对应的索引
>>> condition1 = [[True,False,False],
                   [False,True,True]]
[[0 0]
 [1 1]
 [1 2]]
# 如果有 x y 输入，condition为True用x的对应位置替换，为False则用y
# 下例：
import tensorflow as tf
x = [[1,2,3],[4,5,6]]
y = [[7,8,9],[10,11,12]]
condition3 = [[True,False,False],
             [False,True,True]]
condition4 = [[True,False,False],
             [True,True,False]]
with tf.Session() as sess:
    print(sess.run(tf.where(condition3,x,y)))
    print(sess.run(tf.where(condition4,x,y)))  
# 输出：
1， [[ 1  8  9]
    [10  5  6]]
2， [[ 1  8  9]
    [ 4  5 12]]

（3）计算 proposals 和 gt_boxes（经过上一步后，gt_boxes都是non_crowd框）的重叠 IoU，存在 overlaps 中

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    ...  # 代码省略
    
    # Compute overlaps matrix [proposals, gt_boxes]
    overlaps = overlaps_graph(proposals, gt_boxes)

（4）计算 proposals 和 crowd_boxes 的重叠 IoU，存在 crowd_overlaps 中。

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    ...  # 代码省略
    
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
    crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
    no_crowd_bool = (crowd_iou_max < 0.001)

这里 tf.reduce_max 选择某维中最大的数，用法示例（来源这个博客）：

import tensorflow as tf
import numpy as np

a=np.array([[1, 2],
            [5, 3],
            [2, 6]])

b = tf.Variable(a)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(b))
    print('************')
    # 对于二维矩阵，axis=0轴可以理解为行增长方向（向下）,axis=1轴可以理解为列增长方向(向右）
    print(sess.run(tf.reduce_max(b, axis=1, keepdims=False)))  # keepdims=False,axis=1被消减
    print('************')
    print(sess.run(tf.reduce_max(b, axis=1, keepdims=True)))
    print('************')
    print(sess.run(tf.reduce_max(b, axis=0, keepdims=True)))

输出：

[[1 2]
 [5 3]
 [2 6]]
************
[2 5 6]
************
[[2]
 [5]
 [6]]
************
[[5 6]]

（5）判断positive/negative ROIs：①positive ROIs 是指与 gt_boxes 的最大IoU>=0.5 ②negative是指与 gt_boxes的最大IoU<0.5并且不是crowd（crowd_iou_max<0.001）

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    ...  # 代码省略
    
    # Determine positive and negative ROIs
    roi_iou_max = tf.reduce_max(overlaps, axis=1)
    # 1. Positive ROIs are those with >= 0.5 IoU with a GT box
    positive_roi_bool = (roi_iou_max >= 0.5)
    positive_indices = tf.where(positive_roi_bool)[:, 0]
    # 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds.
    negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]

（6）根据设定的positive数量，控制Positive/Negative比例，对proposals过滤，得到proposal_rois

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    # Determine positive and negative ROIs
    ...  # 代码省略
    
    # Subsample ROIs. Aim for 33% positive
    # Positive ROIs
    positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
                         config.ROI_POSITIVE_RATIO)
    positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
    positive_count = tf.shape(positive_indices)[0]
    # Negative ROIs. Add enough to maintain positive:negative ratio.
    r = 1.0 / config.ROI_POSITIVE_RATIO
    negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
    negative_indices = tf.random_shuffle(negative_indices)[:negative_count]
    # Gather selected ROIs
    positive_rois = tf.gather(proposals, positive_indices)
    negative_rois = tf.gather(proposals, negative_indices)

（7）assign positive rois to gt boxes
roi_gt_box_assignment 是 positive_overlaps 的最大索引值，根据这个索引得到 roi_gt_boxes 和 roi_gt_class_ids

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    # Determine positive and negative ROIs
    # Subsample ROIs. Aim for 33% positive
    ...  # 代码省略
    
    # Assign positive ROIs to GT boxes.
    positive_overlaps = tf.gather(overlaps, positive_indices)
    roi_gt_box_assignment = tf.cond(
        tf.greater(tf.shape(positive_overlaps)[1], 0),
        true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
        false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
    )
    roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
    roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)

（8）计算 roi_gt_boxes 与 positive_rois（这个也是坐标）的 delta

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    # Determine positive and negative ROIs
    # Subsample ROIs. Aim for 33% positive
    # Assign positive ROIs to GT boxes.
    ...  # 代码省略
    
    # Compute bbox refinement for positive ROIs
    deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
    deltas /= config.BBOX_STD_DEV

（9）assign positive rois to gt masks
根据 roi_gt_box_assignment 选择正确的 roi_masks

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    # Determine positive and negative ROIs
    # Subsample ROIs. Aim for 33% positive
    # Assign positive ROIs to GT boxes.
    # Compute bbox refinement for positive ROIs
    ...  # 代码省略
    
    # Assign positive ROIs to GT masks
    # Permute masks to [N, height, width, 1]
    transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
    # Pick the right mask for each ROI
    roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)

    # Compute mask targets
    boxes = positive_rois
    if config.USE_MINI_MASK:
        # Transform ROI coordinates from normalized image space
        # to normalized mini-mask space.
        # If enabled, resizes instance masks to a smaller size to reduce
        # memory load. Recommended when using high-resolution images.
        y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
        gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
        gt_h = gt_y2 - gt_y1
        gt_w = gt_x2 - gt_x1
        y1 = (y1 - gt_y1) / gt_h
        x1 = (x1 - gt_x1) / gt_w
        y2 = (y2 - gt_y1) / gt_h
        x2 = (x2 - gt_x1) / gt_w
        boxes = tf.concat([y1, x1, y2, x2], 1)
    box_ids = tf.range(0, tf.shape(roi_masks)[0])
    masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                     box_ids,
                                     config.MASK_SHAPE)
    # Remove the extra dimension from masks.
    masks = tf.squeeze(masks, axis=3)

    # Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
    # binary cross entropy loss.
    masks = tf.round(masks)

（10）给 rois 把 positive 和 negative cat在一起，而roi_gt_class_ids、delta_masks都补零

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    # Remove zero padding
    # Handle COCO crowds
    # Compute overlaps matrix [proposals, gt_boxes]
    # Compute overlaps with crowd boxes [proposals, crowd_boxes]
    # Determine positive and negative ROIs
    # Subsample ROIs. Aim for 33% positive
    # Assign positive ROIs to GT boxes.
    # Compute bbox refinement for positive ROIs
    # Assign positive ROIs to GT masks
    # Compute mask targets
    ...  # 代码省略
   
    # Append negative ROIs and pad bbox deltas and masks that
    # are not used for negative ROIs with zeros.
    rois = tf.concat([positive_rois, negative_rois], axis=0)
    N = tf.shape(negative_rois)[0]
    P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
    rois = tf.pad(rois, [(0, P), (0, 0)])
    roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
    roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
    deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
    masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])

    return rois, roi_gt_class_ids, deltas, masks

3 关于代码中用到的索引

有时候就很迷惑啊，不晓得代码里是怎么转的。其中一个帮助理解的方法是：编一些假数据，把索引这部分的代码单独拿出来运行一下，看看是怎么变化的。

这里给两个示例：

示例一

索引代码示例一
为了方便理解，建议先阅读 tf.where 和 tf.gather 的用法示例，在→这篇博客。
示例一源码是 class PyramidROIAlign(KE.Layer) 中 call 方法的一部分代码：

# 使用得到的5个融合了不同层级的特征图
pooled = []
box_to_level = []  # box_to_level[i, 0]表示的是当前feat隶属的图片索引，box_to_level[i, 1]表示的是其box序号
for i, level in enumerate(range(2, 6)):  # 只使用2-5四个特征图
    # 先找出需要在第level层计算ROI
    # tf.where 返回格式 [坐标1， 坐标1...]
    # np.where 返回格式 [[坐标1.x, 坐标2.x...], [坐标1.y, 坐标2.y...]]
    # 返回第n张图片的第i个proposal坐标（n对应batch坐标，i对应num_boxes那一维的坐标）
    ix = tf.where(tf.equal(roi_level, level))  # ix是一个坐标集，每个坐标有三个数字，第三位数必然是0（因为roi_level.shape=[batch,num_boxes,1]）。
    # level_boxes 记录对应的level特征层中分配到的每个box的坐标（候选框索引对应的图片）
    # box_indices 记录每个box对应的图片在batch中的索引（候选框索引对应其坐标即小黑框的坐标）
    level_boxes = tf.gather_nd(boxes, ix)  # [本level的proposal数目，4]

    # Box indices for crop_and_resize.
    box_indices = tf.cast(ix[:, 0], tf.int32)  # 记录每个proposal对应图片序号
    # ↑ 取 ix[:,0]是tf.image.crop_and_resize传参需要

    # Keep track of which box is mapped to which level
    box_to_level.append(ix)

这一段索引不好理解，所以我们编几个数据，写一段代码来看具体是怎么变化的：

import numpy as np
import tensorflow as tf
# 我要给你示范一段切片和索引的用法
# 对于某张图片，probs.shape=(N,num_class)
# 其中N为本张图片中检测到的对象数量，在示例中假设N=6，即图片中共监测到6个物体
# num_class为所有训练数据中标记的类别种类总数，示例中假设总共有8种物体
def test():
    box_to_level = []

    # 假设 batch=1 num_boxes=5 在此基础上乱编一些数据：
    roi_level = [[
        [4],
        [3],
        [3],
        [2],
        [5]
    ]]  # roi_level.shape=[batch,num_boxes,1]
    roi_level = np.array(roi_level)
    print('roi_level.shape=', roi_level.shape)
    boxes = [[
        [0.1, 0.3, 0.13, 0.34],
        [0.5, 0.66, 0.67, 0.89],
        [0.4, 0.61, 0.7, 0.8],
        [0.2, 0.3, 0.4, 0.5],
        [0.23, 0.13, 0.43, 0.54]
    ]]  # [batch, num_boxes, (y1, x1, y2, x2)]
    boxes = np.array(boxes)
    print('boxes.shape=', boxes.shape)
    
    # ------------ 运行难理解的代码 --------------
    for i, level in enumerate(range(2, 6)):
        ix = tf.where(tf.equal(roi_level, level))
        level_boxes = tf.gather_nd(boxes, ix)
        box_indices = tf.cast(ix[:, 0], tf.int32)
        print('i=',i,'  level=',level,'  ---------------')
        with tf.Session() as sess:
            print('ix:', sess.run(ix))
            print('level_boxes:', sess.run(level_boxes))
            print('box_indices:', sess.run(box_indices))
        box_to_level.append(ix)
    print("box_to_level:",)
    with tf.Session() as sess:
        for i in box_to_level:
            print(sess.run(i))


if __name__ == '__main__':
    test()

运行结果：

roi_level.shape= (1, 5, 1)
boxes.shape= (1, 5, 4)

roi_level = [[
        [4],
        [3],
        [3],
        [2],
        [5]
    ]]
boxes = [[
    [0.1, 0.3, 0.13, 0.34],
    [0.5, 0.66, 0.67, 0.89],
    [0.4, 0.61, 0.7, 0.8],
    [0.2, 0.3, 0.4, 0.5],
    [0.23, 0.13, 0.43, 0.54]
]]

# i= 0   level= 2   ---------------
ix: [[0 3 0]]
level_boxes: [0.2]
box_indices: [0]
# i= 1   level= 3   ---------------
ix: [[0 1 0]
 [0 2 0]]
level_boxes: [0.5 0.4]
box_indices: [0 0]
# i= 2   level= 4   ---------------
ix: [[0 0 0]]
level_boxes: [0.1]
box_indices: [0]
# i= 3   level= 5   ---------------
ix: [[0 4 0]]
level_boxes: [0.23]
box_indices: [0]

示例二

源码是Detection Layer中 refine_detections_graph 的前几句，mask rcnn的代码如下：

# ----------- 获取每个推荐区域得分最高的class的得分 -----------
    # Class IDs per ROI
    class_ids = tf.argmax(probs, axis=1, output_type=tf.int32)  #[N], 每张图片最高得分类
    # Class probability of the top class of each ROI
    indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1)  # [N, (图片序号, 最高class序号)]
    class_scores = tf.gather_nd(probs, indices)  # [N], 每张图片最高得分类得分值

编几个数据，写一段代码来看具体是怎么变化的：

import numpy as np
import tensorflow as tf
# 我要给你示范一段切片和索引的用法
# 对于某张图片，probs.shape=(N,num_class)
# 其中N为本张图片中检测到的对象数量，在示例中假设N=6，即图片中共监测到6个物体
# num_class为所有训练数据中标记的类别种类总数，示例中假设总共有8种物体
def test():
    probs = np.array([
        [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.3],
        [0, 0.5, 0.2, 0.3, 0.4, 0.1, 0.6, 0.2, 0.3],
        [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.1, 0.3],
        [0, 0.9, 0.2, 0.3, 0.4, 0.5, 0.6, 0.4, 0.3],
        [0, 0.1, 0.2, 0.9, 0.4, 0.5, 0.6, 0.2, 0.3],
        [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.9, 0.6, 0.3],
    ])
    print(probs)
    class_ids = tf.argmax(probs, axis=1, output_type=tf.int32)  # [N], 每张图片最高得分类
    # Class probability of the top class of each ROI
    indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1)  # [N, (图片序号, 最高class序号)]
    class_scores = tf.gather_nd(probs, indices)  # [N], 每张图片最高得分类得分值

    with tf.Session() as sess:
        print(sess.run(indices))
        print(sess.run(class_scores))


if __name__ == '__main__':
    test()

输出结果是：

# 输出结果为：
# probs = 
[[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3]
 [0.  0.5 0.2 0.3 0.4 0.1 0.6 0.2 0.3]
 [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.3]
 [0.  0.9 0.2 0.3 0.4 0.5 0.6 0.4 0.3]
 [0.  0.1 0.2 0.9 0.4 0.5 0.6 0.2 0.3]
 [0.  0.1 0.2 0.3 0.4 0.5 0.9 0.6 0.3]]
 # class_ids = 
 [7 6 6 1 3 6]
 # indices = 
 [[0 7]
 [1 6]
 [2 6]
 [3 1]
 [4 3]
 [5 6]]
 # class_scores = 
 [0.7 0.6 0.6 0.9 0.9 0.9]

所以能否更理解：

tf.gather_nd的用法以及与tf.gather的区别
indices的获得方法与使用意义

写代码的思路：
首先明确变量shape和目标–要从probs中获取每一行最大的得分，而probs是一个2D张量，所以索引也得是2D，用tf.stack就可以办到。要获得每行最大的值的索引，tf.argmax就可以办到。

得到了indices之后，就用tf.gather_nd得到具体的值，到此完成目标！

通往解析三的直通车：mask rcnn 超详细代码解读（三）

Cleo_Gao

关注

12
点赞
踩
58

收藏

觉得还不错? 一键收藏
7
评论
mask rcnn 超详细代码解读（二）

前文： mask rcnn 超详细代码解读（一）（小小声最近忙别的事去了更新拖了一个月。。。接下来会连续一口气争取日更把所有的内容写完）文章目录1 （一）中网络结构总结（刚刚看完一可忽略这段）2 train过程代码继续解析2.1 ROIAlign Layer2.2 Detection Target Layer2.3 Feature Pyramid Network Heads3 关于代码中用到的索引1 （一）中网络结构总结（刚刚看完一可忽略这段）（一）中解析了Resnet Graph、Region P
复制链接

扫一扫

专栏目录