Mask-RCNN代码梳理与学习

最新推荐文章于 2023-08-23 15:34:52 发布

墨水兰亭

最新推荐文章于 2023-08-23 15:34:52 发布

阅读量1.6k

点赞数

分类专栏：人工智障

本文链接：https://blog.csdn.net/moshuilangting/article/details/90774719

版权

人工智障专栏收录该内容

19 篇文章 1 订阅

订阅专栏

github代码：https://github.com/matterport/Mask_RCNN

我们从coco.py看起：

1.数据、参数和模型的读入：

一系列参数的设定：

    args = parser.parse_args()
    ......
    # Configurations
    if args.command == "train":
        config = CocoConfig()
    else:
        class InferenceConfig(CocoConfig):
            # Set batch size to 1 since we'll be running inference on
            # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
            GPU_COUNT = 1
            IMAGES_PER_GPU = 1
            DETECTION_MIN_CONFIDENCE = 0
        config = InferenceConfig()
    config.display()

#读入config参数，这个CocoConfig类继承于mrcnn/config.py中的Config类
#训练和测试的config不同有些参数需要重新设定。

训练和测试参数读入：

    # Train or evaluate
    if args.command == "train":
        # Training dataset. Use the training set and 35K from the
        # validation set, as as in the Mask RCNN paper.
        dataset_train = CocoDataset()
        dataset_train.load_coco(args.dataset, "train", year=args.year, auto_download=args.download)
        if args.year in '2014':
            dataset_train.load_coco(args.dataset, "valminusminival", year=args.year, auto_download=args.download)
        dataset_train.prepare()

        # Validation dataset
        dataset_val = CocoDataset()
        val_type = "val" if args.year in '2017' else "minival"
        dataset_val.load_coco(args.dataset, val_type, year=args.year, auto_download=args.download)
        dataset_val.prepare()

#训练和验证集数据读入，CocoDataset类中的load_coco函数，将.json文件的标签读入list[{},{}],列表内一个个字典形式
#调用prepare方法进行规范化，用于后续的训练

模型的读入：

    # Create model
    if args.command == "train":
        model = modellib.MaskRCNN(mode="training", config=config,
                                  model_dir=args.logs)
    else:
        model = modellib.MaskRCNN(mode="inference", config=config,
                                  model_dir=args.logs)

2.MaskRCNN的模型：

整体模型如图所示、图来源：https://www.cnblogs.com/hellcat/p/9802349.html

我们看mrnn文件夹下的model.py，找到MaskRCNN类。

a.首先是init下的初始化，以及一些输入。

training阶段：

1. input_image 输入图片 [None, None, config.IMAGE_SHAPE[2]]

2. input_image_meta 图片属性（下面代码有解释）

3. RPN部分得到的input_rpn_match [None,1] 和input_rpn_bbox [None,4] ,标签在loss部分使用

4. 检测部分的input_gt_class_ids [None] 和 input_gt_boxes [None,4],以及在mask部分标签 input_gt_masks [config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None]

inference阶段：

1. input_image 输入图片 [None, None, config.IMAGE_SHAPE[2]]

2. input_image_meta 图片属性（下面代码有解释）

3. input_anchors [None,4] 属于预设的anchor

class MaskRCNN():
    """Encapsulates the Mask RCNN model functionality.

    The actual Keras model is in the keras_model property.
    """

    def __init__(self, mode, config, model_dir):
        """
        mode: Either "training" or "inference"
        config: A Sub-class of the Config class
        model_dir: Directory to save training logs and trained weights
        """
        assert mode in ['training', 'inference']
        self.mode = mode
        self.config = config
        self.model_dir = model_dir
        self.set_log_dir()
        self.keras_model = self.build(mode=mode, config=config)

    def build(self, mode, config):
        """Build Mask R-CNN architecture.
            input_shape: The shape of the input image.
            mode: Either "training" or "inference". The inputs and
                outputs of the model differ accordingly.
        """
        assert mode in ['training', 'inference']

        # Image size must be dividable by 2 multiple times

	    # 强制要求了图片裁剪后尺度为2^n，且n>=6，保证下采样后不产生小数
        h, w = config.IMAGE_SHAPE[:2]
        if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
            raise Exception("Image size must be dividable by 2 at least 6 times "
                            "to avoid fractions when downscaling and upscaling."
                            "For example, use 256, 320, 384, 448, 512, ... etc. ")

        # Inputs 输入图片
        input_image = KL.Input(
            shape=[None, None, config.IMAGE_SHAPE[2]], name="input_image")
        """
        IMAGE_META_SIZE由config中设定，参数代表图像属性的张量
        #self.IMAGE_META_SIZE = 1(image_id) + 3(original_image_shape) +         
        3(image_shape) + 4((y1, x1, y2, x2) window of image in in pixels) + 
        1(scale) + self.NUM_CLASSES(active_class_ids)
        """
        input_image_meta = KL.Input(shape=[config.IMAGE_META_SIZE],
                                    name="input_image_meta")

        if mode == "training":
            # RPN GT
            input_rpn_match = KL.Input(
                shape=[None, 1], name="input_rpn_match", dtype=tf.int32)
            input_rpn_bbox = KL.Input(
                shape=[None, 4], name="input_rpn_bbox", dtype=tf.float32)

            # Detection GT (class IDs, bounding boxes, and masks)
            # 1. GT Class IDs (zero padded)
            input_gt_class_ids = KL.Input(
                shape=[None], name="input_gt_class_ids", dtype=tf.int32)
            # 2. GT Boxes in pixels (zero padded)
            # [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] in image coordinates
            input_gt_boxes = KL.Input(
                shape=[None, 4], name="input_gt_boxes", dtype=tf.float32)
            # Normalize coordinates
            gt_boxes = KL.Lambda(lambda x: norm_boxes_graph(
                x, K.shape(input_image)[1:3]))(input_gt_boxes)
            # 3. GT Masks (zero padded)
            # [batch, height, width, MAX_GT_INSTANCES]
            if config.USE_MINI_MASK:
                input_gt_masks = KL.Input(
                    shape=[config.MINI_MASK_SHAPE[0],
                           config.MINI_MASK_SHAPE[1], None],
                    name="input_gt_masks", dtype=bool)
            else:
                input_gt_masks = KL.Input(
                    shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None],
                    name="input_gt_masks", dtype=bool)
        elif mode == "inference":
            # Anchors in normalized coordinates
            input_anchors = KL.Input(shape=[None, 4], name="input_anchors")

b.网络模型：

分为从底往上的残差网络，和从上到下的特征金字塔。

        # Build the shared convolutional layers.
        # Bottom-up Layers 从低向上的resnet
        # Returns a list of the last layers of each stage, 5 in total.
        # Don't create the thead (stage 5), so we pick the 4th item in the list.
        if callable(config.BACKBONE):
            _, C2, C3, C4, C5 = config.BACKBONE(input_image, stage5=True,
                                                train_bn=config.TRAIN_BN)
        else:
            #resnet_graph在这里建立 残差网络
            _, C2, C3, C4, C5 = resnet_graph(input_image, config.BACKBONE,
                                             stage5=True, train_bn=config.TRAIN_BN)
        # Top-down Layers 根据特征层 从上到低 特征金字塔
        # TODO: add assert to varify feature map sizes match what's in config
        P5 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5)
        P4 = KL.Add(name="fpn_p4add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
            KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)])
        P3 = KL.Add(name="fpn_p3add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
            KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)])
        P2 = KL.Add(name="fpn_p2add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
            KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)])
        # Attach 3x3 conv to all P layers to get the final feature maps.
        P2 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2)
        P3 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3)
        P4 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4)
        P5 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5)
        # P6 is used for the 5th anchor scale in RPN. Generated by
        # subsampling from P5 with stride of 2.
        P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)

        # Note that P6 is used in RPN, but not in the classifier heads.
        rpn_feature_maps = [P2, P3, P4, P5, P6]
        mrcnn_feature_maps = [P2, P3, P4, P5]

残差网络具体见resnet_graph函数。

Resnet：

stage1： input—ZeroPadding2D—conv—BN—relu—pool—C1 1（ZeroPadding2D）+1

stage2：C1—conv_block—identity_block—identity_block—C2 2 +3+3+3 =11

stage3：C2—conv_block—identity_block—identity_block—identity_block—C3 11 +3+3+3+3=23

stage4：C3—conv_block—22个identity_block(resnet101)或5个identity_block(resnet50)—C4 23 +3+3×22/3×5=92/41

stage5：C4—conv_block—identity_block—identity_block—C5 92/41 +3+3+3=101/50

特征金字塔的结果，

得到 rpn_feature_maps [P2, P3, P4, P5, P6] 送入rpn中 mrcnn_feature_maps [P2, P3, P4, P5]送入detect

c. anthor、RPN网络和建议区域

1.训练时根据参数设定anthor，测试时是外部生成好的anchor

2.build_rpn_model建立RPN网络，把rpn_feature_maps送入到rpn中，得到rpn的预测rpn_class_logits, rpn_class, rpn_bbox

3. rpn的结果和anthor 生成建议区域proposal。步骤：每张图片获取top k个anthor，然后用RPN回归的结果rpn_bbox [anchors, (dy, dx, log(dh), log(dw))]去修正top k anchor的坐标，最后非最大抑制输出建议区域。

        # Anchors
        if mode == "training":
            anchors = self.get_anchors(config.IMAGE_SHAPE)
            # Duplicate across the batch dimension because Keras requires it
            # TODO: can this be optimized to avoid duplicating the anchors?
            anchors = np.broadcast_to(anchors, (config.BATCH_SIZE,) + anchors.shape)
            # A hack to get around Keras's bad support for constants
            anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
        else:
            anchors = input_anchors


        # RPN Model 建立RPN模型 1（每个像素点都生成anthor）,3（anthor的比例）,256（用于构建特征金字塔的自顶向下层的大小）
        rpn = build_rpn_model(config.RPN_ANCHOR_STRIDE,
                              len(config.RPN_ANCHOR_RATIOS), config.TOP_DOWN_PYRAMID_SIZE)
        
# Loop through pyramid layers
	    # build_rpn_model只是把RPN模型建立好，然后下一步是把残差网络TOP-BOTTOM的P扔进RPN
        layer_outputs = []  # list of lists
        for p in rpn_feature_maps:
            layer_outputs.append(rpn([p]))

        # Concatenate layer outputs
        # Convert from list of lists of level outputs to list of lists
        # of outputs across levels.
        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
        output_names = ["rpn_class_logits", "rpn_class", "rpn_bbox"]
        outputs = list(zip(*layer_outputs))
        outputs = [KL.Concatenate(axis=1, name=n)(list(o))
                   for o, n in zip(outputs, output_names)]

        rpn_class_logits, rpn_class, rpn_bbox = outputs

        # Generate proposals
        # Proposals are [batch, N, (y1, x1, y2, x2)] in normalized coordinates
        # and zero padded.
        proposal_count = config.POST_NMS_ROIS_TRAINING if mode == "training"\
            else config.POST_NMS_ROIS_INFERENCE
        rpn_rois = ProposalLayer(
            proposal_count=proposal_count,
            nms_threshold=config.RPN_NMS_THRESHOLD,
            name="ROI",
            config=config)([rpn_class, rpn_bbox, anchors])

rpn修正anthor的函数：

def apply_box_deltas_graph(boxes, deltas):
    """Applies the given deltas to the given boxes.
    boxes: [N, (y1, x1, y2, x2)] boxes to update  这个是top k个anchor
    deltas: [N, (dy, dx, log(dh), log(dw))] refinements to apply  这个是RPN的回归结果
    """
    # dy = ((y_n1+y_n2)/2-(y_o1+y_o2)/2)/h_o
    # dx = (x_n - x_o)/w_o —— dx = ((x_n1+x_n2)/2-(x_o1+x_o2)/2)/h_o
    # dh = h_n/h_o
    # dw = w_n/w_o
 
    # Convert to y, x, h, w
    height = boxes[:, 2] - boxes[:, 0]
    width = boxes[:, 3] - boxes[:, 1]
    center_y = boxes[:, 0] + 0.5 * height
    center_x = boxes[:, 1] + 0.5 * width
    # Apply deltas
    center_y += deltas[:, 0] * height
    center_x += deltas[:, 1] * width
    height *= tf.exp(deltas[:, 2])
    width *= tf.exp(deltas[:, 3])
    # Convert back to y1, x1, y2, x2
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out")
    return result

d. FPN目标检测、mask的训练和预测

        if mode == "training":
            # Class ID mask to mark class IDs supported by the dataset the image
            # came from.
            # 将input_image_meta（网络输入的第2个）转为parse_image_meta_graph的类，具有图像的一些参数
            active_class_ids = KL.Lambda(
                lambda x: parse_image_meta_graph(x)["active_class_ids"]
                )(input_image_meta)
            

            # 使用RPN ROI或外部生成的ROI进行培训
            # 是否使用RPN的感兴趣区域 一般为True，即 target_rois = rpn_rois
            if not config.USE_RPN_ROIS:
                # Ignore predicted ROIs and use ROIs provided as an input.
                input_rois = KL.Input(shape=[config.POST_NMS_ROIS_TRAINING, 4],
                                      name="input_roi", dtype=np.int32)
                # Normalize coordinates
                target_rois = KL.Lambda(lambda x: norm_boxes_graph(
                    x, K.shape(input_image)[1:3]))(input_rois)
            else:
                target_rois = rpn_rois
            

            # Generate detection targets 
            # 生成检测目标，即把输入的标签转换为网络的最终目标，是训练过程中真正的标签
            # Subsamples proposals and generates target outputs for training
            # Note that proposal class IDs, gt_boxes, and gt_masks are zero
            # padded. Equally, returned rois and targets are zero padded.
            
            # 输入：RPN的感兴趣区域，输入标签分类类别，输入标签box处理过后的结果(位置标签)，输入标签的mask
            # 输出：感兴趣区域，训练目标的类别，训练目标的bounding box，训练目标的mask
            rois, target_class_ids, target_bbox, target_mask =\
                DetectionTargetLayer(config, name="proposal_targets")([
                    target_rois, input_gt_class_ids, gt_boxes, input_gt_masks])

            # FPN网络用于训练目标的类别和bounding box
            # 输入 感兴趣区域，残差训练的特征图，输入图片属性（meta），以及一些参数
            # 输出 分类的logits（softmax之前的），预测分类结果，预测的bbox
            # 细节，ROIAlingn就在这里
            # Network Heads
            # TODO: verify that this handles zero padded ROIs
            mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\
                fpn_classifier_graph(rois, mrcnn_feature_maps, input_image_meta,
                                     config.POOL_SIZE, config.NUM_CLASSES,
                                     train_bn=config.TRAIN_BN,
                                     fc_layers_size=config.FPN_CLASSIF_FC_LAYERS_SIZE)
            
            # 类似，不过单列出分支，得到mask
            # 这里也有ROIAlingn
            # 注意在训练过程中，mask和目标检测部分是不相关的，有ROI进行预测
            # 但是后文预测过程中，mask需要借助与检测结果，区别在于下面函数的第一个参数
            mrcnn_mask = build_fpn_mask_graph(rois, mrcnn_feature_maps,
                                              input_image_meta,
                                              config.MASK_POOL_SIZE,
                                              config.NUM_CLASSES,
                                              train_bn=config.TRAIN_BN)

            # TODO: clean up (use tf.identify if necessary)
            # keras中接收tf的Tensor，不能作为网络数据流，这里增加一层keras做封装。
            output_rois = KL.Lambda(lambda x: x * 1, name="output_rois")(rois)

            # Losses
            # RPN分类损失，输入的标签，RPN网络得到的分类logits，one-hot的多分类损失函数
            # keras.sparse_categorical_crossentropy
            # 真实标签有{1， 0， -1}三种，logits结果在0~1分布，而在RPN分类结果中，真实标签
            # 为0的anchors不参与损失函数的构建，所以我们将标签为0的真实标签剔除，然后将-1标签
            # 转换为0进行交叉熵计算。
            rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
                [input_rpn_match, rpn_class_logits])

            # RPN的bbox损失，输入的标签（bbox和match两个），RPN网络得到的bbox （[batch, anchors, (dy, dx, log(dh), log(dw))]）
            # Fast-RCNN提到的smooth l1 loss 
            # 只有正anchor才会导致损失，中和负anchor不会产生bbox损失
            rpn_bbox_loss = KL.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")(
                [input_rpn_bbox, input_rpn_match, rpn_bbox])
            
            # 根据输入标签转化的目标分类id，mrcnn的分类logits。
            # active_class_ids即将该图片隶属数据集中所有的class标记为1，不隶属本数据集合的class标记为0。
            # softmax的交叉熵损失 tf.nn.sparse_softmax_cross_entropy_with_logits
            class_loss = KL.Lambda(lambda x: mrcnn_class_loss_graph(*x), name="mrcnn_class_loss")(
                [target_class_ids, mrcnn_class_logits, active_class_ids])

            # 与RPN的bbox损失类似，标签（bbox和类别），mrcnn的预测bbox，smooth l1 loss 
            bbox_loss = KL.Lambda(lambda x: mrcnn_bbox_loss_graph(*x), name="mrcnn_bbox_loss")(
                [target_bbox, target_class_ids, mrcnn_bbox])
            
            # mask标签，目标类别，mrcnn预测的mask
            # 二进制交叉熵 [batch,proposals,height,width,num_classes] 
            # keras.binary_crossentropy sigmoid
            mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x), name="mrcnn_mask_loss")(
                [target_mask, target_class_ids, mrcnn_mask])

            # Model
            inputs = [input_image, input_image_meta,
                      input_rpn_match, input_rpn_bbox, input_gt_class_ids, input_gt_boxes, input_gt_masks]
            if not config.USE_RPN_ROIS:
                inputs.append(input_rois)
            outputs = [rpn_class_logits, rpn_class, rpn_bbox,
                       mrcnn_class_logits, mrcnn_class, mrcnn_bbox, mrcnn_mask,
                       rpn_rois, output_rois,
                       rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]
            model = KM.Model(inputs, outputs, name='mask_rcnn')
        else:
            # Network Heads
            # Proposal classifier and BBox regressor heads
            mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\
                fpn_classifier_graph(rpn_rois, mrcnn_feature_maps, input_image_meta,
                                     config.POOL_SIZE, config.NUM_CLASSES,
                                     train_bn=config.TRAIN_BN,
                                     fc_layers_size=config.FPN_CLASSIF_FC_LAYERS_SIZE)

            # Detections
            # output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in
            # normalized coordinates
            detections = DetectionLayer(config, name="mrcnn_detection")(
                [rpn_rois, mrcnn_class, mrcnn_bbox, input_image_meta])

            # Create masks for detections
            detection_boxes = KL.Lambda(lambda x: x[..., :4])(detections)

            # 但是预测过程中，mask需要借助与检测结果，区别在于下面函数的第一个参数
            mrcnn_mask = build_fpn_mask_graph(detection_boxes, mrcnn_feature_maps,
                                              input_image_meta,
                                              config.MASK_POOL_SIZE,
                                              config.NUM_CLASSES,
                                              train_bn=config.TRAIN_BN)

            model = KM.Model([input_image, input_image_meta, input_anchors],
                             [detections, mrcnn_class, mrcnn_bbox,
                                 mrcnn_mask, rpn_rois, rpn_class, rpn_bbox],
                             name='mask_rcnn')

        # Add multi-GPU support.
        if config.GPU_COUNT > 1:
            from mrcnn.parallel_model import ParallelModel
            model = ParallelModel(model, config.GPU_COUNT)

        return model

训练网络输出：

1. RPN网络的输出三个 [rpn_class_logits, rpn_class, rpn_bbox]

2.由RPN得到的感兴趣区域ROI(rpn_rois)，由rpn_rois和标签得到的[output_rois]

3.maskrcnn两个分支得到的目标检测和掩码结果 [mrcnn_class_logits, mrcnn_class, mrcnn_bbox, mrcnn_mask]

4.五个损失 [rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]

预测网络输出：

1.根据RPN网络得到的感兴趣区域rpn_rois得到检测结果，[rpn_rois, rpn_class, rpn_bbox]

2. rpn_rois和特征图输入maskrcnn检测分支得到的 [mrcnn_class, mrcnn_bbox]，（mrcnn_class_logits虽然也可以得到但是没有作为output）

2.由rpn得到的roi和maskrcnn 检测分支结果得到最终检测结果 [detection]

3.detection和maskrcnn的掩码分支得到mask，[mrcnn_mask]