Mask R-CNN 论文阅读笔记

最新推荐文章于 2023-03-23 18:49:47 发布

Tiám青年

最新推荐文章于 2023-03-23 18:49:47 发布

阅读量3.5k

点赞数

分类专栏：计算机视觉深度学习

本文链接：https://blog.csdn.net/xiasli123/article/details/103215145

版权

本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址

一、文献摘要介绍

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

作者为对象实例分割提出了一个概念上简单，灵活且通用的框架Mask R-CNN。可以有效地检测图像中的对象，同时为每个实例生成高质量的分割掩膜。该方法扩展自Faster R-CNN，它通过在其上添加一个分支与现有的边框识别分支并行来预测对象掩膜，Mask R-CNN易于训练，并且以5fps的速度运行时，对Faster R-CNN只增加了很小的开销。此外,Mask R-CNN易于推广到其他任务，例如，在同一个框架中估计人体姿势。作者的模型在COCO挑战赛的所有三个项目中均显示了最佳效果，包括实例分割，边框目标检测，人体关键点检测。

二、网络框架介绍

Mask R-CNN是在扩展Faster R-CNN的基础上建立起来的，关键在于创建掩膜分支。掩膜分支是一个用在RoI上的小型全卷积网络，与分类和边框回归分支是平行的，能够在每一个RoI上按像素方式预测生成高质量的分割掩膜，但增加的计算量并不大，下图展示了实例分割 Mask R-CNN 框架（在作者原图基础上修改了一下）。

Mask R-CNN的优点是很容易在Faster R-CNN的框架上实现和训练，运行速度快，开展实验也方便，下面进行详细的分析该框架。

2.1Faster R-CNN

Faster R-CNN(不懂，请点击)由两个阶段组成，第一阶段称为区域推荐网络(RPN),用于推荐候选边框。第二个阶段本质上是Fast R-CNN，它使用RoIPool从每个候选框中提取特征，并进行分类和边框回归。两个阶段可以共享特征，以加快推理的速度。

2.2Mask R-CNN

Mask R-CNN还从Faster R-CNN中借用了相同的两个阶段流程。第一个阶段就是区域推荐网络（RPN），用来产生候选对象边框。第二个阶段平行于类别和边框预测，给每个RoI 都输出一个二值掩膜，训练过程采用多任务损失

其中， $L_c_l_s$ 和 $L_b_o_x$ 分别是分类损失和边框损失， $L_m_a_s_k$ 是掩膜损失。掩膜分支在每个RoI上产生一个 $Km^2$ 维输出，为 $K$ 个类别各编码一个 $m$ x $m$ 的二值掩膜。 $L_m_a_s_k$ 是一个像素级 $sigmoid$ 定义的平均二值交叉熵损失。如果一个 $RoI$ 关联的真实类别为 $k$ ，那么 $L_m_a_s_k$ 只在第 $k$ 个掩膜上有定义，不受其他掩膜输出的影响。这种定义方式对提高实例分割的效果是非常重要的，它解耦了掩膜和类别的预测，无需考虑类别竞争，使 Mask R-CNN能够独立生成每个类别的掩膜。相反，全卷积网络通常把每个像素分为多类，分割和分类是耦合的，它在实例分割上的效果并不好。

损失函数核心代码。

############################################################
#  Loss Functions损失函数
############################################################

def smooth_l1_loss(y_true, y_pred):
    """实现 Smooth-L1 损失.
    """
    diff = K.abs(y_true - y_pred)
    less_than_one = K.cast(K.less(diff, 1.0), "float32")
    loss = (less_than_one * 0.5 * diff**2) + (1 - less_than_one) * (diff - 0.5)
    return loss


def rpn_class_loss_graph(rpn_match, rpn_class_logits):
    """RPN 锚点分类器损失
    """
    
    rpn_match = tf.squeeze(rpn_match, -1)
    # 获得锚点类别，将-1/+1匹配转换到0/1值
    anchor_class = K.cast(K.equal(rpn_match, 1), tf.int32)
    # 正锚点和负锚点贡献损失，而中立的锚点不贡献损失
    indices = tf.where(K.not_equal(rpn_match, 0))
    # 选择造成损失的行，并过滤掉其余的行。
    rpn_class_logits = tf.gather_nd(rpn_class_logits, indices)
    anchor_class = tf.gather_nd(anchor_class, indices)
    # 交叉熵损失
    loss = K.sparse_categorical_crossentropy(target=anchor_class,
                                             output=rpn_class_logits,
                                             from_logits=True)
    loss = K.switch(tf.size(loss) > 0, K.mean(loss), tf.constant(0.0))
    return loss


def rpn_bbox_loss_graph(config, target_bbox, rpn_match, rpn_bbox):
    """
    返回RPN边框损失
    """
    # 正锚点贡献损失，负锚点不贡献损失
    rpn_match = K.squeeze(rpn_match, -1)
    indices = tf.where(K.equal(rpn_match, 1))

    # 选择贡献损失的边框
    rpn_bbox = tf.gather_nd(rpn_bbox, indices)

    # 将目标边界框增量修剪为与rpn_bbox相同的长度。
    batch_counts = K.sum(K.cast(K.equal(rpn_match, 1), tf.int32), axis=1)
    target_bbox = batch_pack_graph(target_bbox, batch_counts,
                                   config.IMAGES_PER_GPU)

    loss = smooth_l1_loss(target_bbox, rpn_bbox)
    
    loss = K.switch(tf.size(loss) > 0, K.mean(loss), tf.constant(0.0))
    return loss


def mrcnn_class_loss_graph(target_class_ids, pred_class_logits,
                           active_class_ids):
    """掩膜R-CNN分类器头部损失
    """
    target_class_ids = tf.cast(target_class_ids, 'int64')

    # 查找不在数据集中的类的预测。
    pred_class_ids = tf.argmax(pred_class_logits, axis=2)
    pred_active = tf.gather(active_class_ids[0], pred_class_ids)

    # Loss损失
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=target_class_ids, logits=pred_class_logits)

    # 消除不在图像的活动类别中的类别的预测的损失。
    loss = loss * pred_active

    # Computer loss mean. 仅使用有助于损失的预测来获得正确的均值。
    loss = tf.reduce_sum(loss) / tf.reduce_sum(pred_active)
    return loss


def mrcnn_bbox_loss_graph(target_bbox, target_class_ids, pred_bbox):
    """ 
    Mask R-CNN 边框损失
    """
    # 重整形状以合并批处理和roi尺寸以简化操作。
    target_class_ids = K.reshape(target_class_ids, (-1,))
    target_bbox = K.reshape(target_bbox, (-1, 4))
    pred_bbox = K.reshape(pred_bbox, (-1, K.int_shape(pred_bbox)[2], 4))

    #只有正的RoI贡献损失
    positive_roi_ix = tf.where(target_class_ids > 0)[:, 0]
    positive_roi_class_ids = tf.cast(
        tf.gather(target_class_ids, positive_roi_ix), tf.int64)
    indices = tf.stack([positive_roi_ix, positive_roi_class_ids], axis=1)

    #收集造成损失的增量（预测值和真实值）
    target_bbox = tf.gather(target_bbox, positive_roi_ix)
    pred_bbox = tf.gather_nd(pred_bbox, indices)

    # Smooth-L1 Loss
    loss = K.switch(tf.size(target_bbox) > 0,
                    smooth_l1_loss(y_true=target_bbox, y_pred=pred_bbox),
                    tf.constant(0.0))
    loss = K.mean(loss)
    return loss


def mrcnn_mask_loss_graph(target_masks, target_class_ids, pred_masks):
    """
    用于掩膜头部的二值交叉熵损失
    """
    # Reshape for simplicity.  将前两个维度合并为一个。
    target_class_ids = K.reshape(target_class_ids, (-1,))
    mask_shape = tf.shape(target_masks)
    target_masks = K.reshape(target_masks, (-1, mask_shape[2], mask_shape[3]))
    pred_shape = tf.shape(pred_masks)
    pred_masks = K.reshape(pred_masks,
                           (-1, pred_shape[2], pred_shape[3], pred_shape[4]))
    # 将预测的掩膜置换为 [N, num_classes, height, width]
    pred_masks = tf.transpose(pred_masks, [0, 3, 1, 2])

    # 只有正的RoI贡献损失 每个ROI的特定类别的掩膜。
    positive_ix = tf.where(target_class_ids > 0)[:, 0]
    positive_class_ids = tf.cast(
        tf.gather(target_class_ids, positive_ix), tf.int64)
    indices = tf.stack([positive_ix, positive_class_ids], axis=1)

    # 收集造成损失的掩膜（预测的和真实的）
    y_true = tf.gather(target_masks, positive_ix)
    y_pred = tf.gather_nd(pred_masks, indices)

    # 计算二进制交叉熵。 如果没有正的ROI，则返回0。
    # shape: [batch, roi, num_classes]
    loss = K.switch(tf.size(y_true) > 0,
                    K.binary_crossentropy(target=y_true, output=y_pred),
                    tf.constant(0.0))
    loss = K.mean(loss)
    return loss

2.3Mask Representation

Mask R-CNN利用掩膜对输入的对象空间布局进行编码。掩膜布局的抽取不同于类别标签和边框属性的抽取，前者可以通过卷积提供的像素级对应关系来自然完成，而后者需要通过全连接层坍缩形成的短向量来表示。

具体来说，Mask R-CNN采用一个全卷积网络预测每个 $RoI$ 上的 $m$ x $m$ 掩膜，因此可以在掩膜分支的每层都维护清晰的 $m$ x $m$ 对象布局，无须为没有空间维度的向量表示，但在效果上比采用全连接层的方法参数更少、准确率更高。

事实上，为了更忠实地保留空间位置关系以达到更好的像素级掩膜预测，Mask R-CNN还使用了一种对感兴趣区池化( $RoIPool$ )的扩展层，称之为 $RoIAlign$ 。

MaskRCNN 主要代码。

############################################################
#  MaskRCNN Class
############################################################

class MaskRCNN():
    """Encapsulates the Mask RCNN model functionality.

    The actual Keras model is in the keras_model property.
    """

    def __init__(self, mode, config, model_dir):
        """
        mode: Either "training" or "inference"
        config: A Sub-class of the Config class
        model_dir: Directory to save training logs and trained weights
        """
        assert mode in ['training', 'inference']
        self.mode = mode
        self.config = config
        self.model_dir = model_dir
        self.set_log_dir()
        self.keras_model = self.build(mode=mode, config=config)

    def build(self, mode, config):
        """Build Mask R-CNN architecture.
            input_shape: The shape of the input image.
            mode: Either "training" or "inference". The inputs and
                outputs of the model differ accordingly.
        """
        assert mode in ['training', 'inference']

        # Image size must be dividable by 2 multiple times
        h, w = config.IMAGE_SHAPE[:2]
        if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
            raise Exception("Image size must be dividable by 2 at least 6 times "
                            "to avoid fractions when downscaling and upscaling."
                            "For example, use 256, 320, 384, 448, 512, ... etc. ")

        # Inputs
        input_image = KL.Input(
            shape=[None, None, config.IMAGE_SHAPE[2]], name="input_image")
        input_image_meta = KL.Input(shape=[config.IMAGE_META_SIZE],
                                    name="input_image_meta")
        if mode == "training":
            # RPN GT
            input_rpn_match = KL.Input(
                shape=[None, 1], name="input_rpn_match", dtype=tf.int32)
            input_rpn_bbox = KL.Input(
                shape=[None, 4], name="input_rpn_bbox", dtype=tf.float32)

            # Detection GT (class IDs, bounding boxes, and masks)
            # 1. GT Class IDs (zero padded)
            input_gt_class_ids = KL.Input(
                shape=[None], name="input_gt_class_ids", dtype=tf.int32)
            # 2. GT Boxes in pixels (zero padded)
            # [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] in image coordinates
            input_gt_boxes = KL.Input(
                shape=[N

最低0.47元/天解锁文章

Tiám青年

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Mask R-CNN 论文阅读笔记

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址一、文献摘要介绍We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approac...
复制链接

扫一扫