白话Mask RCNN与代码解析

凉寒

已于 2023-02-14 10:04:14 修改

阅读量2.9k

点赞数 2

分类专栏：深度学习文章标签：深度学习人工智能

于 2023-02-14 10:00:31 首次发布

本文链接：https://blog.csdn.net/lengxuelianhan/article/details/129021321

版权

深度学习专栏收录该内容

5 篇文章 1 订阅

订阅专栏

Mask RCNN是在Faster_RCNN基础上改进的得到的集检测与分割于一体的网络模型，主要用于目标检测和实例分割，是在Faster RCNN框架上加入了Mask分支进行像素分割。另外Mask R-CNN 也可以应用到人体姿势识别。
Mask RCNN是基于Faster RCNN的可以看一下Faster RCNN。
接下来将会进行swin-transformer，以swin-transformer为主干的mask rcnn的学习。
阅读的源码是facebook的Mask_RCNN，是一个基于Pytorch的代码。
在这里插入图片描述
1.Mask RCNN是Faster RCNN的扩展，对于Faster RCNN的每个Proposal Box都要使用FCN进行语义分割。。
2.创新RoI Align代替Faster RCNN中的RoI Pooling。RoI Pooling不是按照像素对齐的（pixel-to-pixel alignment），也许这对bbox的影响不是很大，但对于mask的精度却有很大影响。使用RoI Align后mask的精度从10%显著提高到50%。
3.引入语义分割分支，实现了mask和class预测的关系的解耦，mask分支只做语义分割，类型预测和bbox回归任务交给另一个分支。这与原本的FCN网络是不同的，原始的FCN在预测mask时还用同时预测mask所属的种类。

1.backbone + fpn

在这里插入图片描述
如上图所示，最左侧是"由底至顶"采样层(botton-up layers)，中间的是"由顶向底"采样层(top-down layers)，右侧融合不同深度采样层得到最终的多尺度特征层，整个这一块是一个编码-解码结构。
左侧的C1_{C5层为resnet的5个模块，每个模块下采样1/2，每层降采样率是原始图像分辨率的[1/2,1/4,1/8,1/16,1/32]。中间的P5层是经过C5层的1*1卷积得到的，通道数为256，分辨率为32。中间的P4}P1层是由P5层上采样得到的，上采样2倍，分辨率分别为[64,128,256,512]，通道数都为256。右侧的P5～P2在最后进行分类和回归时使用，P6～P2是在RPN中计算proposals时使用。
以下代码给出主体结构，详细部分和使用到的模块请参考源代码github：

# 代码在 maskrcnn-benchmark-main\maskrcnn_benchmark\modeling\backbone\resnet.py处
from collections import namedtuple

import torch
import torch.nn.functional as F
from torch import nn

from maskrcnn_benchmark.layers import FrozenBatchNorm2d
from maskrcnn_benchmark.layers import Conv2d
from maskrcnn_benchmark.layers import DFConv2d
from maskrcnn_benchmark.modeling.make_layers import group_norm
from maskrcnn_benchmark.utils.registry import Registry

StageSpec = namedtuple(
    "StageSpec",
    [
        "index",  # Index of the stage, eg 1, 2, ..,. 5
        "block_count",  # Number of residual blocks in the stage
        "return_features",  # True => return the last feature map from this stage
    ],
)
#这里以ResNet-50-FPN为例，因此相比原代码只留这一部分
# ResNet-50-FPN (including all stages)
ResNet50FPNStagesTo5 = tuple(
    StageSpec(index=i, block_count=c, return_features=r)
    for (i, c, r) in ((1, 3, True), (2, 4, True), (3, 6, True), (4, 3, True))
)

class ResNet(nn.Module):
    def __init__(self, cfg):
        super(ResNet, self).__init__()

        # Translate string names to implementations
        # 使用StemWithFixedBatchNorm是一个 Conv+ bn + relu + maxpooling的基础模块
        stem_module = _STEM_MODULES[cfg.MODEL.RESNETS.STEM_FUNC]
        # R-50-FPN 主干结构 resnet50 + fpn
        stage_specs = _STAGE_SPECS[cfg.MODEL.BACKBONE.CONV_BODY]
        # 残差转置模块 是 resnet中的Bottleneck模块
        transformation_module = _TRANSFORMATION_MODULES[cfg.MODEL.RESNETS.TRANS_FUNC]

        # Construct the stem module 构建StemWithFixedBatchNorm基础模块
        self.stem = stem_module(cfg)

        # Constuct the specified ResNet stages #默认1 
        num_groups = cfg.MODEL.RESNETS.NUM_GROUPS #默认64
        width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP # 1*64
        in_channels = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS # RESNETS输出通道数 256
        stage2_bottleneck_channels = num_groups * width_per_group # bottleneck输入通道数
        stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS # 每个阶段的输出 用于上图中间和右侧1*1卷积，3*3卷积
        self.stages = [] # 保存每个阶段的名称 上述最右侧P5-P2
        self.return_features = {}# 保存每个阶段的feature maps 上述最右侧P5-P2
        # stage_specs是上边的ResNet50FPNStagesTo5结构 
        for stage_spec in stage_specs: 
            name = "layer" + str(stage_spec.index) # 每一层的名称
            stage2_relative_factor = 2 ** (stage_spec.index - 1) # 下一层输入相对stage2_out_channels增大的倍数
            bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor # 每次迭代后 新bottleneck的输入
            out_channels = stage2_out_channels * stage2_relative_factor # 每次迭代后 新bottleneck的输出
            stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1] # 是否使用可形性卷积，默认false
            #见下边_make_stage代码部分
            module = _make_stage(
                transformation_module,# bottleneck模块，其余参数在_make_stage的transformation_module解释
                in_channels,
                bottleneck_channels,
                out_channels,
                stage_spec.block_count,
                num_groups,
                cfg.MODEL.RESNETS.STRIDE_IN_1X1,
                first_stride=int(stage_spec.index > 1) + 1,
                dcn_config={
                    "stage_with_dcn": stage_with_dcn,
                    "with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
                    "deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
                }# 没有用到dcn，暂不解释
            )
            # 上一个网络模块的输出更新下一个模块的输入
            in_channels = out_channels
            self.add_module(name, module)# 添加到nn.Module中进行前向和反向
            self.stages.append(name)
            self.return_features[name] = stage_spec.return_features

        # 冻结部分层次，根据FREEZE_CONV_BODY_AT参数
        self._freeze_backbone(cfg.MODEL.BACKBONE.FREEZE_CONV_BODY_AT)

    def forward(self, x):
        outputs = []
        x = self.stem(x)
        for stage_name in self.stages:
            x = getattr(self, stage_name)(x)
            if self.return_features[stage_name]:
                outputs.append(x)
        return outputs

def _make_stage(
    transformation_module,
    in_channels,
    bottleneck_channels,
    out_channels,
    block_count,
    num_groups,
    stride_in_1x1,
    first_stride,
    dilation=1,
    dcn_config={}
):
    blocks = []
    stride = first_stride
    # block_count 每个阶段模块中Bottleneck的个数，分别为c 3 4 6 4 
    for _ in range(block_count):
        blocks.append(
            # 这一块实际执行Bottleneck，细节见Bottleneck
            transformation_module(
                in_channels,#输入通道
                bottleneck_channels,# bottleneck通道，中作为bottleneck维度不变的那部分通道，具体细节减resnet
                out_channels,# 模块最终的输出通道
                num_groups,# 卷积分组
                stride_in_1x1,# 1*1卷积的跨度
                stride,
                dilation=dilation,#膨胀因子，一般用在分割任务处较多，
                dcn_config=dcn_config
            )
        )
        stride = 1
        in_channels = out_channels
    return nn.Sequential(*blocks)


class Bottleneck(nn.Module):
    def __init__(
        self,
        in_channels,
        bottleneck_channels,
        out_channels,
        num_groups,
        stride_in_1x1,
        stride,
        dilation,
        norm_func,
        dcn_config
    ):
        super(Bottleneck, self).__init__()

        self.downsample = None
        if in_channels != out_channels: # 若输入等于输出
            down_stride = stride if dilation == 1 else 1
            #若if语句满足了，则使用shortcut,就是残差操作使用的特征映射模块
            self.downsample = nn.Sequential(
                Conv2d(
                    in_channels, out_channels,
                    kernel_size=1, stride=down_stride, bias=False
                ),
                norm_func(out_channels),
            )
            # 初始化downsample 的权重
            for modules in [self.downsample,]:
                for l in modules.modules():
                    if isinstance(l, Conv2d):
                        nn.init.kaiming_uniform_(l.weight, a=1)

        if dilation > 1:
            stride = 1 # reset to be 1

        # The original MSRA ResNet models have stride in the first 1x1 conv
        # The subsequent fb.torch.resnet and Caffe2 ResNe[X]t implementations have
        # stride in the 3x3 conv
        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
        # resnet里每个Bottleneck模块的第一个卷积，一般是一个下采样，用1*1卷积
        self.conv1 = Conv2d(
            in_channels,
            bottleneck_channels,
            kernel_size=1,
            stride=stride_1x1,
            bias=False,
        )
        self.bn1 = norm_func(bottleneck_channels)
        # TODO: specify init for the above
        with_dcn = dcn_config.get("stage_with_dcn", False)
        if with_dcn:
            deformable_groups = dcn_config.get("deformable_groups", 1)
            with_modulated_dcn = dcn_config.get("with_modulated_dcn", False)
            self.conv2 = DFConv2d(
                bottleneck_channels,
                bottleneck_channels,
                with_modulated_dcn=with_modulated_dcn,
                kernel_size=3,
                stride=stride_3x3,
                groups=num_groups,
                dilation=dilation,
                deformable_groups=deformable_groups,
                bias=False
            )
        else:
            # resnet里每个Bottleneck模块的第二个卷积，特征提取唯度不变，用3*3卷积
            self.conv2 = Conv2d(
                bottleneck_channels,
                bottleneck_channels,
                kernel_size=3,
                stride=stride_3x3,
                padding=dilation,
                bias=False,
                groups=num_groups,
                dilation=dilation
            )
            nn.init.kaiming_uniform_(self.conv2.weight, a=1)

        self.bn2 = norm_func(bottleneck_channels)
        # resnet里每个Bottleneck模块的第二个卷积，一般是一个上采样，用1*1卷积
        self.conv3 = Conv2d(
            bottleneck_channels, out_channels, kernel_size=1, bias=False
        )
        self.bn3 = norm_func(out_channels)

        for l in [self.conv1, self.conv3,]:
            nn.init.kaiming_uniform_(l.weight, a=1)

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu_(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = F.relu_(out)

        out = self.conv3(out)
        out = self.bn3(out)
        # 若shortcut条件满足，则进行残差操作
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity 
        out = F.relu_(out)

        return out

_TRANSFORMATION_MODULES = Registry({
    "BottleneckWithFixedBatchNorm": BottleneckWithFixedBatchNorm,
    "BottleneckWithGN": BottleneckWithGN,
})

_STEM_MODULES = Registry({
    "StemWithFixedBatchNorm": StemWithFixedBatchNorm,
    "StemWithGN": StemWithGN,
})

_STAGE_SPECS = Registry({
    "R-50-C4": ResNet50StagesTo4,
    "R-50-C5": ResNet50StagesTo5,
    "R-101-C4": ResNet101StagesTo4,
    "R-101-C5": ResNet101StagesTo5,
    "R-50-FPN": ResNet50FPNStagesTo5,
    "R-50-FPN-RETINANET": ResNet50FPNStagesTo5,
    "R-101-FPN": ResNet101FPNStagesTo5,
    "R-101-FPN-RETINANET": ResNet101FPNStagesTo5,
    "R-152-FPN": ResNet152FPNStagesTo5,
})

2 RPN

借用大佬画的图链接：
在这里插入图片描述
左侧anchors是生成锚框流程，中间构建RPN网络流程，右侧的ProposalLayer是筛选ROIs的生成建议框流程。
首先是RPN的整体部分代码：

class RPNModule(torch.nn.Module):

    def __init__(self, cfg, in_channels):
        super(RPNModule, self).__init__()

        self.cfg = cfg.clone()
        #生成anchor
        anchor_generator = make_anchor_generator(cfg)
        # 模块名为RPNHead
        rpn_head = registry.RPN_HEADS[cfg.MODEL.RPN.RPN_HEAD]
        head = rpn_head(
            cfg, in_channels, anchor_generator.num_anchors_per_location()[0]
        )
        
        rpn_box_coder = BoxCoder(weights=(1.0, 1.0, 1.0, 1.0))
        # 训练用到，后处理rpn计算的box
        box_selector_train = make_rpn_postprocessor(cfg, rpn_box_coder, is_train=True)
        # 测试用到
        box_selector_test = make_rpn_postprocessor(cfg, rpn_box_coder, is_train=False)
        # 训练用到计算proposal部分的losses
        loss_evaluator = make_rpn_loss_evaluator(cfg, rpn_box_coder)

        self.anchor_generator = anchor_generator
        self.head = head
        self.box_selector_train = box_selector_train
        self.box_selector_test = box_selector_test
        self.loss_evaluator = loss_evaluator

    def forward(self, images, features, targets=None):
        
        objectness, rpn_box_regression = self.head(features)
        anchors = self.anchor_generator(images, features)

        if self.training:
            return self._forward_train(anchors, objectness, rpn_box_regression, targets)
        else:
            return self._forward_test(anchors, objectness, rpn_box_regression)

    def _forward_train(self, anchors, objectness, rpn_box_regression, targets):
        if self.cfg.MODEL.RPN_ONLY:
            boxes = anchors
        else:
            # For end-to-end models, anchors must be transformed into boxes and
            # sampled into a training batch.
            with torch.no_grad():
                # 对roi进行进行非极大值抑制后留下的微调过的框
                boxes = self.box_selector_train(
                    anchors, objectness, rpn_box_regression, targets
                )
        loss_objectness, loss_rpn_box_reg = self.loss_evaluator(
            anchors, objectness, rpn_box_regression, targets
        )
        losses = {
            "loss_objectness": loss_objectness,
            "loss_rpn_box_reg": loss_rpn_box_reg,
        }
        return boxes, losses

class RPNHead(nn.Module):
    """
    Adds a simple RPN Head with classification and regression heads
    """

    def __init__(self, cfg, in_channels, num_anchors):
        """
        Arguments:
            cfg              : config
            in_channels (int): number of channels of the input feature
            num_anchors (int): number of anchors to be predicted
        """
        super(RPNHead, self).__init__()
        # 一个3*3卷积进行特征提取
        self.conv = nn.Conv2d(
            in_channels, in_channels, kernel_size=3, stride=1, padding=1
        )
        # 一个1*1卷积分类，类别数num_anchors，是个二分类，表示anchors是背景还是object
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        #一个1*1卷积回归，类别数4 * num_anchors，回归每个anchors的偏置
        self.bbox_pred = nn.Conv2d(
            in_channels, num_anchors * 4, kernel_size=1, stride=1
        )
        # 初始化权重
        for l in [self.conv, self.cls_logits, self.bbox_pred]:
            torch.nn.init.normal_(l.weight, std=0.01)
            torch.nn.init.constant_(l.bias, 0)

    def forward(self, x):
        logits = []
        bbox_reg = []
        for feature in x: # 对每个图像进行计算
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

生成anchors，这里采集的特征层是由第一部分basebone最终产生的rpn_feature_maps，共有5层分别为[P2, P3, P4, P5, P6]，通过配置参数对不同分辨率特征层进行锚框采集。

def make_anchor_generator(config):
    anchor_sizes = config.MODEL.RPN.ANCHOR_SIZES# (32, 64, 128, 256, 512)锚框尺寸
    aspect_ratios = config.MODEL.RPN.ASPECT_RATIOS # (0.5, 1.0, 2.0)锚框长宽比
    anchor_stride = config.MODEL.RPN.ANCHOR_STRIDE # (8, 16, 32, 64, 128)特征层相对原图的下采样倍率下采样倍率也可以理解为该特征层生成的锚框中心点在原图中的间隔
    straddle_thresh = config.MODEL.RPN.STRADDLE_THRESH # 0
    # 异常判断条件
    if config.MODEL.RPN.USE_FPN:
        assert len(anchor_stride) == len(
            anchor_sizes
        ), "FPN should have len(ANCHOR_STRIDE) == len(ANCHOR_SIZES)"
    else:
        assert len(anchor_stride) == 1, "Non-FPN should have a single ANCHOR_STRIDE"
    # 生成Anchor的类，见下边
    anchor_generator = AnchorGenerator(
        anchor_sizes, aspect_ratios, anchor_stride, straddle_thresh
    )
    return anchor_generator
# 生成Anchor的类
class AnchorGenerator(nn.Module):

    def __init__(
        self,
        sizes=(128, 256, 512),
        aspect_ratios=(0.5, 1.0, 2.0),
        anchor_strides=(8, 16, 32),
        straddle_thresh=0,
    ):
        super(AnchorGenerator, self).__init__()

        if len(anchor_strides) == 1:
            anchor_stride = anchor_strides[0]
            cell_anchors = [
                # 生成锚框的函数
                generate_anchors(anchor_stride, sizes, aspect_ratios).float()
            ]
        else:
            if len(anchor_strides) != len(sizes):
                raise RuntimeError("FPN should have #anchor_strides == #sizes")

            cell_anchors = [
                generate_anchors(
                    anchor_stride,
                    size if isinstance(size, (tuple, list)) else (size,),
                    aspect_ratios
                ).float()
                for anchor_stride, size in zip(anchor_strides, sizes)
            ]
        self.strides = anchor_strides
        self.cell_anchors = BufferList(cell_anchors)
        self.straddle_thresh = straddle_thresh

    def num_anchors_per_location(self):
        return [len(cell_anchors) for cell_anchors in self.cell_anchors]

    def grid_anchors(self, grid_sizes):
        anchors = []
        for size, stride, base_anchors in zip(
            grid_sizes, self.strides, self.cell_anchors
        ):
            grid_height, grid_width = size
            device = base_anchors.device
            shifts_x = torch.arange(
                0, grid_width * stride, step=stride, dtype=torch.float32, device=device
            )
            shifts_y = torch.arange(
                0, grid_height * stride, step=stride, dtype=torch.float32, device=device
            )
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            anchors.append(
                (shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4)
            )

        return anchors

    def add_visibility_to(self, boxlist):
        image_width, image_height = boxlist.size
        anchors = boxlist.bbox
        if self.straddle_thresh >= 0:
            inds_inside = (
                (anchors[..., 0] >= -self.straddle_thresh)
                & (anchors[..., 1] >= -self.straddle_thresh)
                & (anchors[..., 2] < image_width + self.straddle_thresh)
                & (anchors[..., 3] < image_height + self.straddle_thresh)
            )
        else:
            device = anchors.device
            inds_inside = torch.ones(anchors.shape[0], dtype=torch.bool, device=device)
        boxlist.add_field("visibility", inds_inside)

    def forward(self, image_list, feature_maps):
        grid_sizes = [feature_map.shape[-2:] for feature_map in feature_maps]
        anchors_over_all_feature_maps = self.grid_anchors(grid_sizes)
        anchors = []
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                boxlist = BoxList(
                    anchors_per_feature_map, (image_width, image_height), mode="xyxy"
                )
                self.add_visibility_to(boxlist)
                anchors_in_image.append(boxlist)
            anchors.append(anchors_in_image)
        return anchors
def generate_anchors(
    stride=16, sizes=(32, 64, 128, 256, 512), aspect_ratios=(0.5, 1, 2)
):
    """生成(x1, y1, x2, y2)格式的锚框矩阵。锚点以步幅/ 2为中心，具有指定大小的(近似)根号面积和给定的纵横比。
    """
    return _generate_anchors(
        stride,
        np.array(sizes, dtype=np.float) / stride,
        np.array(aspect_ratios, dtype=np.float),
    )


def _generate_anchors(base_size, scales, aspect_ratios):
    """Generate anchor (reference) windows by enumerating aspect ratios X
    scales wrt a reference (0, 0, base_size - 1, base_size - 1) window.
    """
    anchor = np.array([1, 1, base_size, base_size], dtype=np.float) - 1
    anchors = _ratio_enum(anchor, aspect_ratios) #重新计算 围绕中心点生成anchor
    anchors = np.vstack(
        [_scale_enum(anchors[i, :], scales) for i in range(anchors.shape[0])]
    )
    return torch.from_numpy(anchors)

RPN_ANCHOR_SCALES是anchor尺寸，分别为 (32, 64, 128, 256, 512)，对应rpn_feature_maps的[P2, P3, P4, P5, P6]，分辨率依次为[256,128,64,32,16]，也就是说底层高分辨率特征去检测较小的目标，顶层低分辨率特征图用于去检测较大的目标。最终得到anchors的shape为[anchor_count, (y1, x1, y2, x2)]，此时计算的anchor_count = (256256 + 128128 + 6464 + 3232 + 16*16)*3 = 261888。后续的proposallayer会进行筛选。
上述进行非极大值抑制后得到proposals的具体实现在make_rpn_postprocessor函数中。

3. roi heads

再次借用大佬的图:
在这里插入图片描述

# 检测部分头
class ROIBoxHead(torch.nn.Module):
    """
    Generic Box Head class.
    """

    def __init__(self, cfg, in_channels):
        super(ROIBoxHead, self).__init__()
        self.cfg = cfg
        # 构建一个roi特征提取器
        self.feature_extractor = make_roi_box_feature_extractor(cfg, in_channels)
        # 最终的预测器
        self.predictor = make_roi_box_predictor(
            cfg, self.feature_extractor.out_channels)
        # 进行极大值抑制和锚框回归的后处理
        self.post_processor = make_roi_box_post_processor(cfg)
        # 继续最周的损失计算
        self.loss_evaluator = make_roi_box_loss_evaluator(cfg)

    def forward(self, features, proposals, targets=None):
       
        if self.training:
            # Faster R-CNN subsamples during training the proposals with a fixed
            # positive / negative ratio
            with torch.no_grad():
                # 根据rpn计算的proposals和targets进行采样作为新的proposals
                proposals = self.loss_evaluator.subsample(proposals, targets)

        # extract features that will be fed to the final classifier. The
        # feature_extractor generally corresponds to the pooler + heads
        x = self.feature_extractor(features, proposals)#提取特征，用于分类
        # final classifier that converts the features into predictions
        class_logits, box_regression = self.predictor(x) # 进行分类和框回归

        if not self.training:
            result = self.post_processor((class_logits, box_regression), proposals)
            return x, result, {}

        loss_classifier, loss_box_reg = self.loss_evaluator(
            [class_logits], [box_regression]
        )
        return (
            x,
            proposals,
            dict(loss_classifier=loss_classifier, loss_box_reg=loss_box_reg),
        )

以下是分割部分的头

# 这一部分与上述检测头基本相同，主要区别在于分类是对每个像素分类，而不用对object box分类
class ROIMaskHead(torch.nn.Module):
    def __init__(self, cfg, in_channels):
        super(ROIMaskHead, self).__init__()
        self.cfg = cfg.clone()
        self.feature_extractor = make_roi_mask_feature_extractor(cfg, in_channels)
        self.predictor = make_roi_mask_predictor(
            cfg, self.feature_extractor.out_channels)
        self.post_processor = make_roi_mask_post_processor(cfg)
        self.loss_evaluator = make_roi_mask_loss_evaluator(cfg)

    def forward(self, features, proposals, targets=None):
       
        if self.training:
            # during training, only focus on positive boxes
            all_proposals = proposals
            proposals, positive_inds = keep_only_positive_boxes(proposals)
        if self.training and self.cfg.MODEL.ROI_MASK_HEAD.SHARE_BOX_FEATURE_EXTRACTOR:
            x = features
            x = x[torch.cat(positive_inds, dim=0)]
        else:
            x = self.feature_extractor(features, proposals)
        mask_logits = self.predictor(x)

        if not self.training:
            result = self.post_processor(mask_logits, proposals)
            return x, result, {}

        loss_mask = self.loss_evaluator(proposals, mask_logits, targets)

        return x, all_proposals, dict(loss_mask=loss_mask)

最后看一下make_roi_box_feature_extractor。这里使用到roi_align,在Pooler类中对特征进行提取。

class FPN2MLPFeatureExtractor(nn.Module):
    def __init__(self, cfg, in_channels):
        super(FPN2MLPFeatureExtractor, self).__init__()

        resolution = cfg.MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION
        scales = cfg.MODEL.ROI_BOX_HEAD.POOLER_SCALES
        sampling_ratio = cfg.MODEL.ROI_BOX_HEAD.POOLER_SAMPLING_RATIO
        # 可以参考代码查看Pooler的细节
        pooler = Pooler(
            output_size=(resolution, resolution),
            scales=scales,
            sampling_ratio=sampling_ratio,
        )
        input_size = in_channels * resolution ** 2
        representation_size = cfg.MODEL.ROI_BOX_HEAD.MLP_HEAD_DIM
        use_gn = cfg.MODEL.ROI_BOX_HEAD.USE_GN
        self.pooler = pooler
        self.fc6 = make_fc(input_size, representation_size, use_gn)
        self.fc7 = make_fc(representation_size, representation_size, use_gn)
        self.out_channels = representation_size

    def forward(self, x, proposals):
        x = self.pooler(x, proposals) # 特征提取
        x = x.view(x.size(0), -1) # 展平为一维

        x = F.relu(self.fc6(x)) # 计算分类
        x = F.relu(self.fc7(x)) # 计算回归

        return x

RoIAlign相比较于roiPooling的优势，RoIAlign并没有取整的过程，可以全程使用浮点数操作，步骤如下：

计算RoI区域的边长，边长不取整；
将ROI区域均匀分成k × k个bin，每个bin的大小不取整；
每个bin的值为其最邻近的Feature Map的四个值通过双线性插值得到；
使用Max Pooling或者Average Pooling得到长度固定的特征向量。
例如输入一张800×800 的图片，经过一个有5次降采样的卷机网络，得到大小为 25×25 的Feature Map。若ROI区域大小是 600×500 ，经过网络之后对应的区域为 $\frac {600} {32}$ ∗ $\frac {500} {32}$ = 18.75×15.625，由于无法整除，ROI Pooling采用向下取整的方式，进而得到ROI区域的Feature Map的大小为 18 × 15，这就造成了第一次区域不匹配。
RoI Pooling的下一步是对Feature Map分bin，加入我们需要一个7 × 7的bin，每个bin的大小为 $\frac{18} {7}$ ∗ $\frac{18} {7}$ ，由于不能整除，ROI同样采用了向下取整的方式，从而每个bin的大小为 2 × 2，即整个RoI区域的Feature Map的尺寸为14 × 14,第二次区域不匹配问题因此产生。对比ROI Pooling之前的Feature Map，ROI Pooling分别在横向和纵向产生了4.75和1.625的误差，对于物体分类或者物体检测场景来说，这几个像素的位移或许对结果影响不大（但是经过更精细的损失计算box回归效果会更好），但是语义分割任务通常要精确到每个像素点，因此ROI Pooling是不能应用到Mask R-CNN中的。

参考：Mask R-CNN讲解_江南綿雨的博客-CSDN博客_mask rcnn

参考：BINGO Hong：MASK_RCNN代码详解(4)-Losses部分

凉寒

关注

2
点赞
踩
15

收藏

觉得还不错? 一键收藏
打赏
4
评论
白话Mask RCNN与代码解析

Mask RCNN是在Faster_RCNN基础上改进的得到的集检测与分割于一体的网络模型，主要用于目标检测和实例分割，是在Faster RCNN框架上加入了Mask分支进行像素分割。另外Mask R-CNN 也可以应用到人体姿势识别。Mask RCNN是基于Faster RCNN的可以看一下Faster RCNN。接下来将会进行swin-transformer，以swin-transformer为主干的mask rcnn的学习。
复制链接

扫一扫