pytorch中创建maskrcnn模型

TomcatLikeYou

已于 2024-05-28 14:04:25 修改

阅读量1.3k

点赞数 24

文章标签： pytorch 人工智能 python

于 2024-04-28 11:30:37 首次发布

本文链接：https://blog.csdn.net/qq_37293230/article/details/138127095

版权

前言

大部分的贴出来的示例代码都是pytorch的示意精简版,忽略了一些为了通用性的代码,仅用于解释逻辑,但不影响正常执行.
代码连接: 自定义调用torch_version的mask-rcnn项目
训练和推理的流程有点区别,如图
在这里插入图片描述

0.模型输入/输出参数参见

链接: pytorch的mask-rcnn的模型参数解释
核心代码
GeneralizedRCNN(这里以mask-rcnn来解释说明)

# 通过输入图像获取fpn特征图,注意这里的backbone不是直接的resnet,而是fpn化后的
features = self.backbone(images.tensors)
# 由于是mask-rcnn,故而是一个dict,这里是处理非fpn的包装为dict
if isinstance(features, torch.Tensor):
    features = OrderedDict([("0", features)])
# RPN负责生成候选区域（proposals）。它基于前面提取的特征features，以及输入图像images和目标targets（如果有的话，例如在训练阶段）来生成这些候选区域。同时，它也可能返回与候选区域生成相关的损失proposal_losses。
proposals, proposal_losses = self.rpn(images, features, targets)
#ROI Heads（Region of Interest Heads）负责对这些候选区域进行分类和边界框回归。它基于RPN生成的候选区域proposals，前面提取的特征features，以及输入图像的大小images.image_sizes和目标targets来执行这些任务。同时，它返回检测结果detections和与分类和边界框回归相关的损失detector_losses。
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
# 后处理步骤通常包括将检测结果从模型输出格式转换为更易于解释或可视化的格式。例如，它可能包括将边界框坐标从模型使用的格式转换为图像的实际像素坐标，或者对分类得分进行阈值处理以过滤掉低置信度的检测。
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)  # type: ignore[operator]
# 汇总损失
losses = {}
losses.update(detector_losses)
losses.update(proposal_losses)

核心模型类:

MaskRCNN(FasterRCNN)
FasterRCNN(GeneralizedRCNN)
GeneralizedRCNN(nn.Module)
# 处理原始图像
GeneralizedRCNNTransform(nn.Module)
BackboneWithFPN(nn.Module)
IntermediateLayerGetter(nn.ModuleDict)
FeaturePyramidNetwork(nn.Module)
# RPN网络
RegionProposalNetwork(nn.Module)
# 建立anchor
AnchorGenerator(nn.Module)
# rpnhead
RPNHead(nn.Module)

1.提取特征图

通过骨干网络（如ResNet）提取输入图像的特征图

1.1 执行transform

对输入的images,targets执行transform,主要是标准化和resize的合并操作

1.1.1 images执行标准化操作

如果是采用的imagenet权重,则一般采用以下参数执行标准化操作.主要作用屏蔽图像色温/曝光一类的影响.
注意,传入到模型的images是0-1之间的float表示的tensor

image_mean = [0.485, 0.456, 0.406]
image_std = [0.229, 0.224, 0.225]

1.1.2 images执行resize操作

同一个batch的图像需要缩放到合适尺寸,以供后面合并tensor使用,请注意,并不是resize为统一尺寸,
只是限制在了min_size和max_size之间
可以定义dataloader时的sampler,可以创建一个相似宽高比的图像分配到一组组成batch的sapmler以优化计算.
模型需要指定一个max_size以及一个min_size.过大/小的图像会被缩放至一个合适尺寸.
resize方法可以采用双线性插值法(pytorch的模型是这么干的),或者直接填充.

当min_size=800,max_size=1333时,示例数据如下
image 输入形状 [[3,312, 632], [3,490, 564], [3,884, 1494], [3,658, 1008]]
image 输出形状 [[3,658, 1333], [3,800, 920], [3,788, 1333], [3,800, 1225]]
过小的数据会被线性插值放大,过大的会被线性插值缩小,masks同样如此(近邻插值).
缩放比例取决与宽高谁更远离最大/最小值,即保留图像比例缩放图像,宽高更大的那个在800-1333之间

注意,resize后的尺寸需要记录下来,后续操作需要用到图像的原始尺寸,即resize尺寸

1.1.3targets的mask执行resize操作

图像resize了,mask也需要同样操作,不然对不上.方法和图像的一致

1.1.4 images执行合并tensor操作

将list(tensor([N,H,W]))[B] 合并为tensor[B,N,H,W]以便传入backbone

将标准化和缩放过的图像,从list合并为同一个tensor list(4)->tensor(4,3,h,w)
注意,输入的hw可能会不一致,那么此时需要兼容最大尺寸.下面是batch=4时的一个输入尺寸实例
[[3, 800, 920],[3, 658, 1333],[3, 788, 1333],[3, 800, 1225]]
那么输出应该形如 Tensor(4,3,800,1344)
之所以不是1333,而是1344,是因为size_divisible参数,使之对齐到32的整数倍了,这样会更便于后面的计算

1.2 创建backbone,以及backbone_with_fpn

1.2.1 使用backbone网络,提取创建fpn网络

例如restnet50,不需要返回分类信息,而是前面的多层特征图信息,然后组合为fpn数据
调用栈
GeneralizedRCNN.forward->
RegionProposalNetwork.forward->
BackboneWithFPN.forward

backbone = resnet50(weights=weights_backbone, progress=progress, norm_layer=norm_layer)
backbone = _resnet_fpn_extractor(backbone, trainable_backbone_layers)

输出结果形如:
在这里插入图片描述

1.2.2 提取fpn特征数据

现在我们来提取输出features
一共分2步,body和fpn
body从backbone提取原始特征层数据 (pytorch 定义为 IntermediateLayerGetter 类)
fpn将其处理打包为fpn结构数据 (pytorch 定义为 FeaturePyramidNetwork 类)

def forward(self, x: Tensor) -> Dict[str, Tensor]:
     x = self.body(x)
     x = self.fpn(x)
     return x

1.2.2.1 body

resnet的网络结构如下,我们需要提取layer1-4的输出结果来构建fpn,
提取所有return_layers最后一个层,即layer4之前的所有层,即抛弃掉无用的avgpool和fc层
resnet50所有层
参考精简代码如下:

import copy
from collections import OrderedDict

import torch
from PIL import Image
from torchvision.models import resnet50, ResNet50_Weights
from torchvision import transforms

if __name__ == '__main__':
    model = resnet50(weights=ResNet50_Weights.DEFAULT)
    original_img = Image.open("a.png").convert('RGB')
    ts = transforms.Compose([transforms.ToTensor()])
    img = ts(original_img)
    img = torch.unsqueeze(img, dim=0)
    model.eval()

    # 需要返回用以构建fpn层的名字
    return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}
    # 创建return_layers的副本
    return_layers_copy = copy.deepcopy(return_layers)
    # 存储有效的层
    layers = OrderedDict()
    # 提取所有return_layers最后一个层,即layer4之前的所有层,即抛弃掉无用的avgpool和fc层
    for name, module in model.named_children():
        layers[name] = module
        if name in return_layers_copy:
            del return_layers_copy[name]
        # 如果指示的层被删光了,说明遍历到最后一个了,跳出循环
        if not return_layers_copy:
            break
    # 创建结果集合
    out = OrderedDict()
    for name, module in layers.items():
        img = module(img)
        if name in return_layers:
            out_name = return_layers[name]
            out[out_name] = img
    # rs = model(img)
    print(out)

输出入下图
提取有用的层,并且获取数据

1.2.2.2 fpn

对body输出的out的4个结果分别执行1x1的卷积操作(共4个卷积核,输出均为256,输入是256~2048)得到结果,深度一致是为了后面rpn_head使用,可以共用卷积
精简代码如下:

o2 = list()
in_channels_list = [256, 512, 1024, 2048]
out_channels = 256
# 使用一个1x1的卷积核处理为同样深度的特征图
inner_blocks = nn.ModuleList()
for index, in_channels in enumerate(in_channels_list):
	inner_block_module = nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0)
	inner_blocks.append(inner_blocks)
	# 等效于不设定激活函数和归一化层的Conv2d,注意Conv2dNormActivation会自动计算padding,以使之尺寸不变
	# inner_block_module2 = Conv2dNormActivation(in_channels, out_channels, kernel_size=1, padding=0, norm_layer=None,
	#                                            activation_layer=None)
	x = out.get(str(index))
	x = inner_block_module(x)
	o2.append(x)

 print(o2)

输出如下
在这里插入图片描述对同一深度的结果(o2)操作上采样以及相加,即:
注:此处方便理解,C代表代码中的o2,C2代表o2的下标2,所以有0,和论文中的不太一致
C3=P3
P3上采样+C2=P2
P2上采样+C1=P1
P1上采样+C0=P0

对P0-3分别做一次3x3的卷积,对P3做最大池化得pool
P0->3x3卷积=out(0)
P1->3x3卷积=out(1)
P2->3x3卷积=out(2)
P3->3x3卷积=out(3)
P3->maxpool=pool
代码简化示意如下:

 # 执行上采样以及合并操作,以及结果再次卷积
    o3 = list()
    # 最后一个直接丢到结果集上,作为p4
    last_inner = o2[-1]
    o3.append(last_inner)

    # 倒着遍历o2,从倒数第3个开始
    # 使用一个3x3的卷积核再次处理结果, 减少上采样的混叠效应
    layer_blocks = nn.ModuleList()
    for idx in range(len(o2) - 2, -1, -1):
        # 获取当前这个,以及形状
        inner_lateral = o2[idx]
        feat_shape = inner_lateral.shape[-2:]
        # 对上层的那个执行上采样
        inner_top_down = nn.functional.interpolate(last_inner, size=feat_shape)
        # 相加作为P3~P1
        last_inner = inner_lateral + inner_top_down

        # 使用一个3x3的卷积核再次处理结果,减少上采样的混叠效应
        layer_block_module = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=0)
        layer_blocks.append(layer_block_module)

        # 倒序存储到结果上
        o3.insert(0, layer_block_module(last_inner))

    # 取出最小特征图做一次池化,池化核1x1,步长2
    tm = nn.functional.max_pool2d(o3[-1], kernel_size=1, stride=2, padding=0)
    o3.append(tm)

    names = ["0", "1", "2", "3", "pool"]
    # make it back an OrderedDict
    out = OrderedDict([(k, v) for k, v in zip(names, o3)])
    print(o3)

在这里插入图片描述以上全部演示代码

import copy
from collections import OrderedDict

import torch
from PIL import Image
from torchvision.models import resnet50, ResNet50_Weights
from torchvision import transforms
from torchvision.ops import Conv2dNormActivation
from torch import nn

if __name__ == '__main__':
    model = resnet50(weights=ResNet50_Weights.DEFAULT)
    original_img = Image.open("a.png").convert('RGB')
    ts = transforms.Compose([transforms.ToTensor()])
    img = ts(original_img)
    img = torch.unsqueeze(img, dim=0)
    model.eval()

    # 需要返回用以构建fpn层的名字
    return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}
    # 创建return_layers的副本
    return_layers_copy = copy.deepcopy(return_layers)
    # 存储有效的层
    layers = OrderedDict()
    # 提取所有return_layers最后一个层,即layer4之前的所有层,即抛弃掉无用的avgpool和fc层
    for name, module in model.named_children():
        layers[name] = module
        if name in return_layers_copy:
            del return_layers_copy[name]
        # 如果指示的层被删光了,说明遍历到最后一个了,跳出循环
        if not return_layers_copy:
            break
    # 创建结果集合
    out = OrderedDict()
    for name, module in layers.items():
        img = module(img)
        if name in return_layers:
            out_name = return_layers[name]
            out[out_name] = img
    print(out)

    in_channels_list = [256, 512, 1024, 2048]
    out_channels = 256

    # 创建1x1卷积,并执行卷积操作
    o2 = list()
    # 使用一个1x1的卷积核处理为同样深度的特征图
    inner_blocks = nn.ModuleList()
    for index, in_channels in enumerate(in_channels_list):
        # 用以执行处理同一深度的卷积核
        inner_block_module = nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0)
        inner_blocks.append(inner_blocks)
        # 等效于不设定激活函数和归一化层的Conv2d
        # inner_block_module2 = Conv2dNormActivation(in_channels, out_channels, kernel_size=1, padding=0,
        #                                            norm_layer=None,activation_layer=None)
        x = out.get(str(index))
        x = inner_block_module(x)
        o2.append(x)

    # 执行上采样以及合并操作,以及结果再次卷积
    o3 = list()
    # 最后一个直接丢到结果集上,作为p4
    last_inner = o2[-1]
    o3.append(last_inner)

    # 倒着遍历o2,从倒数第3个开始
    # 使用一个3x3的卷积核再次处理结果, 减少上采样的混叠效应
    layer_blocks = nn.ModuleList()
    for idx in range(len(o2) - 2, -1, -1):
        # 获取当前这个,以及形状
        inner_lateral = o2[idx]
        feat_shape = inner_lateral.shape[-2:]
        # 对上层的那个执行上采样
        inner_top_down = nn.functional.interpolate(last_inner, size=feat_shape)
        # 相加作为P3~P1
        last_inner = inner_lateral + inner_top_down

        # 使用一个3x3的卷积核再次处理结果,减少上采样的混叠效应
        layer_block_module = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=0)
        layer_blocks.append(layer_block_module)

        # 倒序存储到结果上
        o3.insert(0, layer_block_module(last_inner))

    # 取出最小特征图做一次池化,卷积核1x1,步长2
    tm = nn.functional.max_pool2d(o3[-1], kernel_size=1, stride=2, padding=0)
    o3.append(tm)

    names = ["0", "1", "2", "3", "pool"]
    # make it back an OrderedDict
    out = OrderedDict([(k, v) for k, v in zip(names, o3)])
    print(o3)

在这个过程中,可以学习的参数分布在
1.backbone网络,可以锁定一些层不更新,pytorch默认是更新3个层
2.处理特征图为同一深度的1x1核的卷积层
3.最后输出前的3x3的卷积层

2 构建RPN网络

上文我们获取了5张尺寸各异,深度为256的特征图,下面我们对他进行RPN即
(Region Proposal Network)区域生成网络构建.
注意: 后面的原图,均指被transform的resize处理过的图像,而不是真的原本的图像
注意:这个图像尺寸,是经过了一次resize变换,而transform总共进行了2次变换.一次resize,一次合并为同一个tensor(尺寸均一致了),resize后尺寸是不一致的

        features = self.backbone(images.tensors)
        if isinstance(features, torch.Tensor):
            features = OrderedDict([("0", features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)  # type: ignore[operator]

2.1 创建anchors

fpn创建了5张特征图(下图示例),我们取最大尺寸的为例,(其他的做法同理):
请添加图片描述

特征图尺寸为: 1x256x200x272 (batch_size=1) ->记作A
原始图像尺寸 800x1066 (实际尺寸为1848, 2464)->记作B
可以看出 A的尺寸是B的4倍(大约) .
做一个映射,一个A的点,映射到B上就是4个像素点.
我给A的每一个点,创建9个候选框,长宽比为(0.5,1,2).长为(16,32,64)
那么总计就是 200x272x9 = 489600个候选框.
下面计算出每一个候选框的实际坐标,按照(x1,y1,x2,y2)返回一个(489600,4)的tensor

2.1.1 基础cell_anchor构建

示例代码,计算出9个候选框的基础尺寸,其中scales代表是面积的开方,也可以理解为,在长宽比为1时,边长就是scales值.(注意实际使用的代码应该返回一个list,以适应5个特征图)
调用栈
GeneralizedRCNN.forward->
RegionProposalNetwork.forward->
AnchorGenerator.forward

import torch
from torch import Tensor


def generate_anchors(
        scales,
        aspect_ratios,
        dtype: torch.dtype = torch.float32,
        device: torch.device = torch.device("cpu"),
) -> Tensor:
    scales = torch.as_tensor(scales, dtype=dtype, device=device)
    aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
    h_ratios = torch.sqrt(aspect_ratios)
    w_ratios = 1 / h_ratios

    ws = (w_ratios[:, None] * scales[None, :]).view(-1)
    hs = (h_ratios[:, None] * scales[None, :]).view(-1)

    base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
    return base_anchors.round()


if __name__ == '__main__':
    size = (16, 32, 64)
    aspect_ratio = (0.5, 1.0, 2.0)
    print(generate_anchors(size, aspect_ratio))

输出结果

tensor([[-11.,  -6.,  11.,   6.],
        [-23., -11.,  23.,  11.],
        [-45., -23.,  45.,  23.],
        [ -8.,  -8.,   8.,   8.],
        [-16., -16.,  16.,  16.],
        [-32., -32.,  32.,  32.],
        [ -6., -11.,   6.,  11.],
        [-11., -23.,  11.,  23.],
        [-23., -45.,  23.,  45.]])

2.1.2 计算中心点相加cell_anchor

使用基础cell_anchor,A的所有点坐标,以及和原图的缩放关系,计算所有anchor.
cell_anchor为(9x4),即A的每个点坐标都有9和候选框
A尺寸为(200x272),即共有54400个坐标
即最终结果应该是(54400x9x4) = (489600x4)的tensor
其中4是由cell_anchor+缩放过比例的A坐标得到
简化代码如下

import torch

from test2 import generate_anchors

if __name__ == '__main__':
    grid_height = 200
    grid_width = 272
    # 和原图的缩放比例
    stride_height = torch.tensor(4)
    stride_width = torch.tensor(4)
    device = torch.device("cpu")
    # 依照缩放比例,即步幅,将特征图像素点缩放到原始图上
    shifts_x = torch.arange(0, grid_width, dtype=torch.int32, device=device) * stride_width
    shifts_y = torch.arange(0, grid_height, dtype=torch.int32, device=device) * stride_height
    # 将(200x272)=54400个中心点数据(cx,cy) 变为54400x4的数据,即54400x(x1,y1,x2,y2)=(cx,cy,cx,cy)
    shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x, indexing="ij")
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)
    # shifts表示共有54400个中心点,每个中心点的坐标为(x1,y1,x2,y2);
    shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)
    # 获取cell_anchors(9,4) 表示每个中心点都有9中可能,坐标距离(x1,y1,x2,y2)偏移分别是(a1,b1,a2,b2)(例如:[-11.,  -6.,  11.,   6.])
    size = (16, 32, 64)
    aspect_ratio = (0.5, 1.0, 2.0)
    cell_anchor = generate_anchors(size, aspect_ratio)
    # 只需要将shifts(54400x4) + cell_anchors(9x4)相加即可的得到结果(489600x4),即shifts,cell_anchors的第一维分别扩展9和54400倍即可
    anchor = shifts.view(-1, 1, 4) + cell_anchor.view(1, -1, 4)
    # 将 54400x9x4 转为 489600x4 ,代表共有489600的anchor,每个含1个坐标(x1,y1,x2,y2)
    anchor = anchor.view(-1, 4)
    print(anchor.shape)

对5个特征图执行相同操作,注意size的选用.在不同特征图上,size可以设置为不同形状的不同值,以适应在不同特征图尺寸的表现.
分辨率越大的特征图,缩放比例就越小,中心点间距小,故而size的设定就应越小
分辨率越小的特征图,缩放比例就越大,中心点间距大,故而size的设定就应越大
例如size可以设定为((16, 32, 64),(32, 64, 128),(64,128,256),(128,256,512),(256,512,1024)),当然也可以设定为一样的,看需求如何,同理aspect_ratio 也可以按照实际的需求自行设定,比如存在细长棍状的物体时,可以比例设置为(0.25, 0.5, 1.0, 2.0, 4.0)
按照每个scles=3.aspect_ratio =3来计算.总共获得个尺寸候选框分别为,最后加起来就好
$\begin{align} 200 \times272 \times9&=489600\\ 100 \times136 \times9&=122400\\ 50 \times68 \times9&=30600\\ 25 \times34 \times9&=7650\\ 13 \times19 \times9&=2223\\ sum()&=652473 \end{align}\\$

2.1.3 计算所有anchors,以及复制images个

以上是单个尺度的测试简单代码.以下是含多尺寸的代码,执行过程如图

使用
strides 缩放尺度
cell_anchors, 基础边框
features fpn特征图
构建所有尺度的anchor,然后合并.最后复制images给每一个图像

创建anchor的核心代码执行结果

class AnchorGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        # size的长度必须和层级数量一致
        self.sizes = ((8, 16, 32), (16, 32, 64), (32, 64, 128), (64, 128, 256), (256, 512, 1024))
        self.aspect_ratios = ((0.5, 1.0, 2.0),) * len(self.sizes)
        self.device = torch.device("cpu")

    def forward(self, image_sizes, images, features):
        """
        image_sizes resize后的原始图像大小
        images 合并后统一的图像Tensor(4,3,800,1344)
        features fpn的特征图
        """
        # 上获取合并后统一的图像Tensor(4,3,800,1344)大小,即 800x1344
        # 分别除以特征图每一个层级的大小,得到每一层级的缩放比例strides
        strides = []
        grid_sizes = [feather.shape[-2:] for feather in features]
        image_size = images.shape[-2:]
        for g in grid_sizes:
            stride_x = torch.empty((), dtype=torch.int64, device=self.device).fill_(image_size[0] // g[0])
            stride_y = torch.empty((), dtype=torch.int64, device=self.device).fill_(image_size[1] // g[1])
            strides.append([stride_x, stride_y])

        # 创建所有层级的cell_anchors
        # 获取cell_anchors(9,4) 表示每个中心点都有9中可能,坐标距离(x1,y1,x2,y2)偏移分别是(a1,b1,a2,b2)(例如:[-11.,  -6.,  11.,   6.])
        cell_anchors = [self.generate_anchors(size, aspect_ratio) for size, aspect_ratio in
                        zip(self.sizes, self.aspect_ratios)]
        """
        存放每个层级计算好的anchor坐标,形如
        [torch.Size([595188, 4]),
        torch.Size([146412, 4]),
        torch.Size([35424, 4]),
        torch.Size([8280, 4]),
        torch.Size([2160, 4])]
        """
        anchors_over_all_feature_maps: list[torch.Tensor] = []
        # 计算不同尺度下的anchor,然后合并
        for stride, cell_anchor, feather in zip(strides, cell_anchors, features):
            grid_height, grid_width = feather.shape[-2:]
            # 和原图的缩放比例
            stride_height, stride_width = stride
            # 依照缩放比例,即步幅,将特征图像素点缩放到原始图上
            shifts_x = torch.arange(0, grid_width, dtype=torch.int32, device=self.device) * stride_width
            shifts_y = torch.arange(0, grid_height, dtype=torch.int32, device=self.device) * stride_height
            # 将(200x272)=54400个中心点数据(cx,cy) 变为54400x4的数据,即54400x(x1,y1,x2,y2)=(cx,cy,cx,cy)
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x, indexing="ij")
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            # shifts表示共有54400个中心点,每个中心点的坐标为(x1,y1,x2,y2);
            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            # 只需要将shifts(54400x4) + cell_anchors(9x4)相加即可的得到结果(489600x4),即shifts,cell_anchors的第一维分别扩展9和54400倍即可
            anchor = shifts.view(-1, 1, 4) + cell_anchor.view(1, -1, 4)
            # 将 54400x9x4 转为 489600x4 ,代表共有489600的anchor,每个含1个坐标(x1,y1,x2,y2)
            anchor = anchor.view(-1, 4)
            anchors_over_all_feature_maps.append(anchor)

        # 合并到Tensor中的图像具有相同的大小。
        # anchors_over_all_feature_maps是图像的anchor。对于批次上的所有图像，anchor应该是一致的。因此，多次复制图像以对应于不同的图像
        anchors = []
        for _ in range(len(image_sizes)):
            anchors_in_image = [anchors_per_feature_map for anchors_per_feature_map in anchors_over_all_feature_maps]
            anchors.append(anchors_in_image)
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        return anchors

    @staticmethod
    def generate_anchors(
            scales,
            aspect_ratios,
            dtype: torch.dtype = torch.float32,
            device: torch.device = torch.device("cpu"),
    ) -> Tensor:
        """
        创建单个层级的基础框,形如.注意,需要为每一个层级都创建一个
        tensor([[-11.,  -6.,  11.,   6.],
        [-23., -11.,  23.,  11.],
        [-45., -23.,  45.,  23.],
        [ -8.,  -8.,   8.,   8.],
        [-16., -16.,  16.,  16.],
        [-32., -32.,  32.,  32.],
        [ -6., -11.,   6.,  11.],
        [-11., -23.,  11.,  23.],
        [-23., -45.,  23.,  45.]])
        """
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1 / h_ratios

        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        return base_anchors.round()

2.2 计算每个anchor分类和回归值

2.2.0 计算目的

得到这么多的候选框,需要进行一次判断,过滤大部分无用的anchor.
在这里插入图片描述

需要判断这个anchor是否保留,即判断他存在物体的可能性有多高,例如设定可能性大于70%即保留.
那这个存在物体就是一个二分类问题,即上图的2k scores
预测框和实际物体边框(GT)的差距有多少,就是一个回归问题.即上图的4k coordinates
注意,这个是需要对每一个候选框进行的操作

2.2.1 创建RPNHead,计算每个anchor分类和回归值

使用所有的特征图作为输入,创建一个模型RPNHead,计算预测,返回2个结果,分类结果与回归结果
即:使用一个3x3的卷积处理后分别接上分类卷积和回归卷积,输出通道数量应为每个点的可能的anchor数,即K,计算应得到2个结果
即
objectness(252473x1) = (
200x272x9x1+
100x136x9x1
…
)
pred_bbox_deltas (252473x4)= (
200x272x9x4+
100x136x9x4
…
)
注意 pred_bbox_deltas 并不是直接返回的直接边界值,而是和GT的偏差,因为图像尺寸不同原因,且是参数化后值,故有_deltas后缀
模型代码如下
调用栈
GeneralizedRCNN.forward->
RegionProposalNetwork.forward->
RPNHead.forward

class RPNHead(nn.Module):
    def __init__(self, in_channels: int, num_anchors: int, conv_depth=1) -> None:
        super().__init__()
        convs = []
        for _ in range(conv_depth):
            convs.append(Conv2dNormActivation(in_channels, in_channels, kernel_size=3, norm_layer=None))
        self.conv = nn.Sequential(*convs)
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)

        for layer in self.modules():
            if isinstance(layer, nn.Conv2d):
                torch.nn.init.normal_(layer.weight, std=0.01)  # type: ignore[arg-type]
                if layer.bias is not None:
                    torch.nn.init.constant_(layer.bias, 0)  # type: ignore[arg-type]

		# 省略若干方法
    def forward(self, x: List[Tensor]) -> Tuple[List[Tensor], List[Tensor]]:
        logits = []
        bbox_reg = []
        for feature in x:
            t = self.conv(feature)
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

在这里插入图片描述

2.2.2 解码pred_bbox_deltas

2.2.2.1 参数化公式如下: 预测偏移转预测框时(decode解码)(均以中心表示)

$\begin{align} anchor框(anchors)&:(x_a,y_a,w_a,h_a)\\ 预测框与anchor框的差值(pred\_bbox\_deltas)&:(t_x,t_y,t_w,t_h) \\ 预测框(解码结果)&:(x,y,w,h) \\ x&=t_xw_a + x_a\\ y&=t_yh_a + y_a\\ w&=w_a*e^{t_w}\\ h&=h_a*e^{t_h}\\ \end{align}\\$

$注意指数运算会可能会爆炸\\ 需要对t_w,t_h执行clamp到ln(62.5),即缩放后限制到62.5\\$

2.2.2.2 代码逻辑:处理输入

1.基础 anchors,形如:
[torch.Size([787464, 4]),
torch.Size([787464, 4]),
torch.Size([787464, 4]),
torch.Size([787464, 4])]
一共4个图像,每个图像有787464个候选框,每个候选框含4个坐标

2.rpnhead的pred_bbox_deltas,形如:
[torch.Size([4, 36, 198, 334]),
torch.Size([4, 36, 98, 166]),
torch.Size([4, 36, 48, 82]),
torch.Size([4, 36, 23, 40]),
torch.Size([4, 36, 12, 20])]
list每一行是一个FPN层级,每个层级:[batch_size,4*9,h,w] 第一维度是4个图像,每个图像有9个候选框,每个候选框含4个坐标,故而为36,后面是h和w

对于每一张图候选框数量为: 787464 = (198x334+98x166+48x82+23x40+12x20)x9
对于所有的图候选框数量为: 787646 x 4 = 3149856

整理一下数据,将 anchors和pred_bbox_deltas格式转为一致 Tensor(3149856,4)得:
anchors-> boxes; pred_bbox_deltas->rel_codes

代码参见: torchvision.models.detection.rpn.concat_box_prediction_layers

2.2.2.3 实际计算

先将boxes的x1,y1.x2,y2转为 w,h,cx.cy(2点表示转中心宽高)
从rel_codes获取xywh偏移dx,dy,dw,dh.即0,1,2,3位置,(可能存在权重,如果存在就要除权重)
限制dw,dh再范围内math.log(1000.0 / 16)
计算偏移后坐标
$cx\\ pcy = dy*h + cy\\ pw = exp(dw) * w\\ ph = exp(dh) * h\\$
再将p转为2点表示
合并tensor

示意代码如下

    @staticmethod
    def decode(anchors, rel_codes):
        """
        对RpnHead的pred_bbox_deltas进行解码操作,得到真的预测框
        """
        concat_boxes = torch.cat(anchors, dim=0)

        max_clip = math.log(62.5)
        # 2点表示转中心宽高
        w = concat_boxes[:, 2] - concat_boxes[:, 0]
        h = concat_boxes[:, 3] - concat_boxes[:, 1]
        cx = concat_boxes[:, 0] + 0.5 * w
        cy = concat_boxes[:, 1] + 0.5 * h

        # 获取偏移dx,dy,dw,dh.即0,1,2,3位置,(可能存在权重,如果存在就要除权重)
        dx = rel_codes[:, 0]
        dy = rel_codes[:, 1]
        dw = rel_codes[:, 2]
        dh = rel_codes[:, 3]

        # 限制偏移再范围内
        dw = torch.clamp(dw, max=max_clip)
        dh = torch.clamp(dh, max=max_clip)

        # 计算中心宽高表示的预测框,p表示predict
        pcx = dx * w + cx
        pcy = dy * h + cy
        pw = torch.exp(dw) * w
        ph = torch.exp(dh) * h

        # 中心宽高换2点表示
        pw2 = pw/2
        ph2 = ph/2
        px1 = pcx - pw2
        py1 = pcy - ph2
        px2 = pcx + pw2
        py2 = pcy + ph2
        # 合并tensor
        pred_boxes = torch.stack((px1, py1, px2, py2), dim=1)
        return pred_boxes

2.2.3 过滤数据准备

为过滤无用的候选框整理一下数据

2.2.3.1 proposals

# 将Tensor(3580920,4) -> Tensor(num_images,-1,4) ,即拆分到每个照片中,即得到proposals
proposals = pred_boxes.reshape(num_images, -1, 4)

2.3过滤候选框

2.3.1 通过objectness得分过滤,整理数据

2.3.1.1 输入参数

proposals: Tensor,
torch.Size([4, 268569, 4])
objectness: Tensor,
torch.Size([1074276, 1])
image_shapes: List[Tuple[int, int]],
形如:[(788, 1333), (800, 1225), (658, 1333), (800, 920)],用于限制超范围的proposals
num_anchors_per_level: List[int],
形如:[201600, 50400, 12600, 3150, 819] 用以筛选objectness前k条,以及index序号用

2.3.1.2 基础过滤,整理数据

取出每一个层级前K条数据,取index返回即可
$若我们的每个层级的anchor数为\\ [201600, 50400, 12600, 3150, 819]\\ k = 2000\\ 则共取到8819个\\ [2000, 2000, 2000, 2000, 819] = 8819 \\$
注意,由于含batch信息,故而,取到的index结果,应该是Tensor(batch_size,8819)
注意.通过一次sigmoid处理objectness得到0-1直接的概率值记为得分scores

注意:由于nms需要每个框的类别索引,故还需构建levels

from torchvision.ops import boxes as box_ops
 # 先同一objectness格式
 num_images = len(image_sizes)
 objectness = objectness.reshape(num_images, -1)

 # 由于nms需要每个框的类别索引,故构建levels
 levels = [
     torch.full((n,), idx, dtype=torch.int64) for idx, n in enumerate(num_anchors_per_level)
 ]
 levels = torch.cat(levels, 0)
 levels = levels.reshape(1, -1).expand_as(objectness)

 # 按层级切分,每个层级按objectness取前self.top_idx_n个数据
 top_n_idxs = []
 offset = 0
 for ob in objectness.split(num_anchors_per_level, 1):
     num_anchors = ob.shape[1]
     _, top_n_idx = ob.topk(min(self.top_idx_n, num_anchors), dim=1)
     top_n_idxs.append(top_n_idx + offset)
     offset += num_anchors
 top_n_idxs = torch.cat(top_n_idxs, dim=1)

 # 从数据中提取指定idx的
 image_range = torch.arange(num_images)
 batch_idx = image_range[:, None]
 objectness = objectness[batch_idx, top_n_idxs]
 proposals = proposals[batch_idx, top_n_idxs]
 levels = levels[batch_idx, top_n_idxs]
 # 由于是2分类问题,直接做sigmoid作为概率
 objectness_prob = torch.sigmoid(objectness)

2.3.2 对每一张图像进行过滤

基本操作都有库函数可以用,位于

from torchvision.ops import boxes as box_ops

2.3.2.1 将proposals限制到图像范围内

# 将proposals限制到图像范围内
box_ops.clip_boxes_to_image(boxes, image_size)

可以直接调用 torchvision.ops.boxes.clip_boxes_to_image.主要逻辑就是以下代码

    @staticmethod
    def clip_boxes_to_image(boxes: Tensor, size: tuple[int, int]) -> Tensor:
        boxes_x = boxes[:, 0::2]
        boxes_y = boxes[:, 1::2]
        height, width = size

        boxes_x = boxes_x.clamp(min=0, max=width)
        boxes_y = boxes_y.clamp(min=0, max=height)

        clipped_boxes = torch.stack((boxes_x, boxes_y), dim=2)
        return clipped_boxes.reshape(boxes.shape)

2.3.2.2 移除小候选框

注意,是移除任意一边小于指定值的box

# 移除小候选框
keep = box_ops.remove_small_boxes(boxes, self.min_size)
boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

2.3.2.3 移除低分候选框

# 移除低分候选框
keep = torch.where(scores >= self.score_thresh)[0]
boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

2.3.2.4 nms非极大值抑制(每一级别)

# nms非极大值抑制(每一级别),返回值是按得分降序的索引
keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)

2.3.2.5 按得分保留条n数据

keep = keep[: self.post_nms_top_n]
boxes, scores = boxes[keep], scores[keep]

2.3.3 完整过滤代码

def filter_proposals(self, proposals, objectness, image_sizes, num_anchors_per_level):
     """
     Args:
         proposals: Tensor,形如:torch.Size([4, 268569, 4])
         objectness: Tensor,形如:torch.Size([1074276, 1])
         image_sizes: List[Tuple[int, int]],形如:[(788, 1333), (800, 1225), (658, 1333), (800, 920)],用于限制超范围的proposals
         num_anchors_per_level: List[int],形如:[201600, 50400, 12600, 3150, 819] 用以筛选objectness前k条,以及index序号用
     """
     # 先同一objectness格式
     num_images = len(image_sizes)
     objectness = objectness.reshape(num_images, -1)

     # 由于nms需要每个框的类别索引,故构建levels
     levels = [
         torch.full((n,), idx, dtype=torch.int64) for idx, n in enumerate(num_anchors_per_level)
     ]
     levels = torch.cat(levels, 0)
     levels = levels.reshape(1, -1).expand_as(objectness)

     # 按层级切分,每个层级按objectness取前self.top_idx_n个数据
     top_n_idxs = []
     offset = 0
     for ob in objectness.split(num_anchors_per_level, 1):
         num_anchors = ob.shape[1]
         _, top_n_idx = ob.topk(min(self.top_idx_n, num_anchors), dim=1)
         top_n_idxs.append(top_n_idx + offset)
         offset += num_anchors
     top_n_idxs = torch.cat(top_n_idxs, dim=1)

     # 从数据中提取指定idx的
     image_range = torch.arange(num_images)
     batch_idx = image_range[:, None]
     objectness = objectness[batch_idx, top_n_idxs]
     proposals = proposals[batch_idx, top_n_idxs]
     levels = levels[batch_idx, top_n_idxs]
     # 由于是2分类问题,直接做sigmoid作为概率
     objectness_prob = torch.sigmoid(objectness)

     # 保存结果,每条记录是每个照片筛选后的proposals
     final_boxes = []
     final_scores = []
     for boxes, scores, lvl, image_size in zip(proposals, objectness_prob, levels, image_sizes):
         # 将proposals限制到图像范围内
         box_ops.clip_boxes_to_image(boxes, image_size)
         # 移除小候选框
         keep = box_ops.remove_small_boxes(boxes, self.min_size)
         boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
         # 移除低分候选框
         keep = torch.where(scores >= self.score_thresh)[0]
         boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
         # nms非极大值抑制(每一级别),返回值是按得分降序的索引
         keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)

         keep = keep[: self.post_nms_top_n]
         boxes, scores = boxes[keep], scores[keep]
         final_boxes.append(boxes)
         final_scores.append(scores)
     return final_boxes, final_scores

2.2.4 计算损失

返回的5个损失中,这个2个在此时计算(训练过程)

key	type	dtype	size	损失函数	remark
loss_objectness	Tensor	float32	()	CrossEntropyLoss	RPN分类损失（RPN Classification Loss）：前景/背景二分类损失
loss_rpn_box_reg	Tensor	float32	()	Smooth L1/MSE	RPN边界框回归损失（RPN Bounding Box Regression Loss）

2.2.4.0 关联,编码与抽样

计算loss,使用的是预测的objectness, pred_bbox_deltas和labels进行计算的
计算与前面的proposal以及过滤proposal无联系,是独立的
预测结果是参数化后的偏移量(pred_bbox_deltas),label标记的是实际的边界信息,所以需要使用gt_box与anchors进行编码,计算回归目标(regression_targets).
回归loss实际是计算的pred_bbox_deltas与regression_targets的smooth_l1_loss
anchor实际数量大约在几十万~上百万不等,大部分样本均为负样本,可能几百个都不到.计算loss时,会导致数据不平衡

如果直接使用原始数据进行训练，模型可能会偏向于多数类（负样本），因为模型在优化过程中会试图最小化总体损失，而多数类的损失对总体损失的贡献更大。+

而且过大的(意义不强)数据量,会加重计算量,且抽样可以一定程度上起到正则化效果
所以可以执行一个平衡采样,即尽量使正负样本为1:1(当然可以自定义),pytorch默认值为每个照片取256个样本.平衡比例是0.5,即1:1
参见源码:

torchvision.models.detection._utils.BalancedPositiveNegativeSampler

2.2.4.0.1 将anchor指定target

要计算分类和回归损失,以计算一个anchor为例,需要知道这个anchor对应的gt框才能计算,所以要给每一个anchor分配一个Target的gt框.
函数:

labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
"""
n为anchor的个数
labels:Tensor(n,)  分类标签,-1代表不用,0代表负样本,1代表正样本
matched_gt_boxes:Tensor(n,4) 对应了哪个gt_box,以及坐标,注意除了正样本的数据,其他都没有意义
"""

例:此张图片共创建了268596个anchor(Tensor(268596,4)),标注了3个对象.gt_box(Tensor(3,4))

那么需要给268596个anchor分配对应于哪一个gt_box
假设这3个对象都是小对象,那么绝大多数的anchor都不会与任何一个gt_box重合,也就无法分配
先解决一下重合问题,最常见的算法就是iou算法
Tensor(268596,4)与Tensor(3,4)计算iou也就是得到Tensor(268596,3),也就是每一个anchor与所有的gt_box的重合度
我们计算一个matched_idxs(Tensor(268596,)),存储这个anchor最对应的gt_box的下标.
取出3个gt框个重合度最高的那个.如果这个值为0,就这个anchor意味着完全不和任意的gt_box重合
我们设定2个阈值, high_threshold = 0.7,low_threshold=0.3. 如果大于0.7就是正例,小于0.3的就是反例.0.3-0.7之间的含糊不清,直接丢弃好了
如果大于0.7的就设定为对应的下标,0.7-0.3之间的就标记为-2,小于0.3的就标记为-1
所以matched_idxs的数据去重后应该是[-2,-1,0,1,2],
其中 -2,-1占绝大多数(假设这3个对象都是小对象),代表的是没匹配到足够合适的GT框. 可以直接对应0号gt_box(pytorch这样做的),反正也不用,或者填充0都行(没意义,计算回归损失不使用负样本)
其中 0,1,2,代表匹配到的gt框的index值,直接对应好了
这样==回归标签(matched_gt_boxes)(未编码)==就做好了
将-1的记为负样本->赋值为0,-2的丢弃->赋值为-1,其他的是正样本->赋值为1,这样==分类标签(labels)==就做好了.

代码示意

 def assign_targets_to_anchors(self, anchors, targets):
        labels = []
        matched_gt_boxes = []
        for anchors_per_image, targets_per_image in zip(anchors, targets):
            gt_boxes = targets_per_image["boxes"]
            # 先判断是gt_boxes否为空,为空,则所有标签都是负样本,所有对应的target都无意义,填充0
            if gt_boxes.numel() == 0:
                matched_gt_boxes_per_image = torch.zeros(anchors_per_image.shape, dtype=torch.float32)
                labels_per_image = torch.zeros((anchors_per_image.shape[0],), dtype=torch.float32)
            else:
                # 计算iou
                iou = box_ops.box_iou(gt_boxes, anchors_per_image)
                # 取出几个gt_box,iou最大的那个,matched_vals为值, matches是index
                matched_vals, matches = iou.max(dim=0)
                # 依据设定阈值,将iou分为3个等级,最高的max > 0.7(正样本),中间的0.7>mid>0.3(丢弃,记作-2),最小的 min<0.3(负样本,记作-1)
                min_match = matched_vals < self.bg_iou_thresh
                mid_match = (matched_vals >= self.bg_iou_thresh) & (matched_vals < self.fg_iou_thresh)
                matches[min_match] = -1
                matches[mid_match] = -2
                # 此时的matches去重后应该形如[-2,-1,0,1,2],其中 -2,-1占绝大多数(假设这3个对象都是小对象),代表的是没匹配到足够合适的GT框.
                # 其中 0,1,2,代表匹配到的gt框的index值. 再依照matches的结果,计算matched_gt_boxes_per_image,labels_per_image
                # matched_gt_boxes_per_image,matches负数代表的是没匹配到足够合适的GT框.
                # 可以直接对应0号gt_box(pytorch这样做的),反正也不用,或者填充0都行(没意义)
                # 其中 0,1,2,代表匹配到的gt框的index值,直接对应好了
                matched_gt_boxes_per_image = gt_boxes[matches.clamp(min=0)]
                # 将0,1,2 转为表示正样本的 1 ,将 -1 转为 表示负样本的 0 ,将 -2 转为表示丢弃的 -1
                labels_per_image = matches >= 0
                labels_per_image = labels_per_image.to(torch.float32)
                labels_per_image[matches == -1] = 0
                labels_per_image[matches == -2] = -1
            labels.append(labels_per_image)
            matched_gt_boxes.append(matched_gt_boxes_per_image)
        return labels, matched_gt_boxes

2.2.4.0.2 编码gt_boxes

此操作和解码pred_bbox_deltas 类似

操作	参数	目的
编码	gt_boxes 和 anchors	gt框与anchors的实际偏移
解码	pred_bbox_deltas和anchors	预测偏移与anchors的预测框

$\begin{align} anchor框(anchors)&:(x_a,y_a,w_a,h_a)\\ gt\_boxes&:(x,y,w,h) \\ 预测框与anchor框的差值(编码结果)&:(t_x,t_y,t_w,t_h) \\ t_x&= (x - x_a)/w_a\\ t_y&=(y-y_a)/ h_a\\ t_w&=ln(w/w_a)\\ t_h&=ln(h/h_a)\\ \end{align}\\$

    @staticmethod
    def encode(anchors, gt_boxes):
        ax1 = anchors[:, 0].unsqueeze(1)
        ay1 = anchors[:, 1].unsqueeze(1)
        ax2 = anchors[:, 2].unsqueeze(1)
        ay2 = anchors[:, 3].unsqueeze(1)

        gx1 = gt_boxes[:, 0].unsqueeze(1)
        gy1 = gt_boxes[:, 1].unsqueeze(1)
        gx2 = gt_boxes[:, 2].unsqueeze(1)
        gy2 = gt_boxes[:, 3].unsqueeze(1)

        # 转中心点表示
        aw = ax2 - ax1
        ah = ay2 - ay1
        acx = ax1 + 0.5 * aw
        acy = ay1 + 0.5 * ah

        gw = gx2 - gx1
        gh = gy2 - gy1
        gcx = gx1 + 0.5 * gw
        gcy = gy1 + 0.5 * gh

        # 计算结果
        tx = (gcx - acx) / aw
        ty = (gcy - acy) / ah
        tw = torch.log(gw/aw)
        th = torch.log(gh/ah)

        # 合并
        targets = torch.cat((tx, ty, tw, th), dim=1)
        return targets

2.2.4.1 实际计算

2.2.4.1.0 抽样

使用平衡抽样,使用每张图抽样256,正负样本1:1抽样
代码请见

torchvision.models.detection._utils.BalancedPositiveNegativeSampler

2.2.4.1.1 RPN分类损失,RPN边界框回归损失

    def compute_loss(self, objectness, pred_bbox_deltas, labels, regression_targets):
        # 采样,取出样本的index,合并
        sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
        sampled_pos_inds = torch.where(torch.cat(sampled_pos_inds, dim=0))[0]
        sampled_neg_inds = torch.where(torch.cat(sampled_neg_inds, dim=0))[0]
        sampled_inds = torch.cat([sampled_pos_inds, sampled_neg_inds], dim=0)

        # 都合并掉image一层然后计算
        objectness = objectness.flatten()
        labels = torch.cat(labels, dim=0)
        regression_targets = torch.cat(regression_targets, dim=0)

        # 回归使用平滑L1损失
        box_loss = nn.functional.smooth_l1_loss(
            pred_bbox_deltas[sampled_pos_inds],
            regression_targets[sampled_pos_inds],
            beta=1 / 9,
            reduction="sum",
        ) / (sampled_inds.numel())

        # 使用交叉熵损失
        objectness_loss = nn.functional.binary_cross_entropy_with_logits(objectness[sampled_inds], labels[sampled_inds])

        return objectness_loss, box_loss

2.2.5 返回结果

将2.3.2.5过滤后的proposals即ROI,以及loss返回,
示例代码

    def forward(self, image_sizes, images, features, targets):
        # 从dict提取值做list
        features = list(features.values())
        """
        获取所有的候选框.返回一个长度为batch_size的list(Tensor(n,4)),n为每个图像的候选框个数(多个层级加起来的)
        输出形如
        [torch.Size([787464, 4]),
        torch.Size([787464, 4]),
        torch.Size([787464, 4]),
        torch.Size([787464, 4])]
        """
        anchors = self.anchor_generator(image_sizes, images, features)
        # 记录一下照片数量,即batch_size
        num_images = len(anchors)
        """
        使用特征图进行预测,获取是对象的概率(objectness),以及边框信息(pred_bbox_deltas). 返回长度为fpn的层级数的list(Tensor(batch_size,k/4k,h,w))
        注意,注意,pred_bbox_deltas并不是直接返回的直接边界值,而是和GT的偏差,因为图像尺寸不同原因,且是参数化后值,故有_deltas后缀
        rpn_head,预测出了所有点的所有框是否为对象,以及距离gt框的偏差
        objectness 形如
        [torch.Size([4, 9, 198, 334]),
        torch.Size([4, 9, 98, 166]),
        torch.Size([4, 9, 48, 82]),
        torch.Size([4, 9, 23, 40]),
        torch.Size([4, 9, 12, 20])]
        pred_bbox_deltas 形如
        [torch.Size([4, 36, 198, 334]),
        torch.Size([4, 36, 98, 166]),
        torch.Size([4, 36, 48, 82]),
        torch.Size([4, 36, 23, 40]),
        torch.Size([4, 36, 12, 20])]
        """
        objectness, pred_bbox_deltas = self.rpn_head(features)
        # 从objectness抽取每张图像每一层有多少个anchor数,提供给filter_proposals使用
        num_anchors_per_level = []
        for obj in objectness:
            sp = obj.shape
            num_anchors_per_level.append(sp[1] * sp[2] * sp[3])
        # 将objectness, pred_bbox_deltas展平为(n,1)和(n,4),不是核心方法,这边直接用库函数了
        objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness, pred_bbox_deltas)
        # 解码,pred_bbox_deltas,因为是预测的结果,不要影响他,故而detach一份
        pred_boxes = self.decode(anchors, pred_bbox_deltas.detach())
        # 将Tensor(3580920,4) -> Tensor(num_images,-1,4) ,即拆分到每个照片中,即得到proposals
        proposals = pred_boxes.reshape(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness.detach(), image_sizes, num_anchors_per_level)

        losses = {}
        # 训练阶段需要计算loss
        if self.training:
            if targets is None:
                raise ValueError("targets should not be None")
            # 将anchor和target的gt_box匹配起来,且确定哪些是正负样本
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            # 对gt_boxes(对应后就是:matched_gt_boxes)进行编码
            regression_targets = self.encode(anchors, matched_gt_boxes)
            # 实际计算loss
            loss_objectness, loss_rpn_box_reg = self.compute_loss(
                objectness, pred_bbox_deltas, labels, regression_targets
            )
            losses = {
                "loss_objectness": loss_objectness,
                "loss_rpn_box_reg": loss_rpn_box_reg,
            }
        return boxes, losses

3.构建ROI_Heads

3.1 总述示例

3.1.1 示例以及处理

先回忆一下之前计算的数据.首先我们不管loss的计算,先只看前向传播(先不看mask部分)
也就是我要用proposal(roi)去预测每一个roi的分类概率是多少,边框是什么样
例:

设: batch_size为4;
每个image有2000个proposal;则proposals是一个长度为4,每个元素为Tensor(2000,4)的list;
分类数量为5;
特征图features为

0: torch.Size([4, 256, 200, 336])
1: torch.Size([4, 256, 100, 168])
2: torch.Size([4, 256, 50, 84])
3: torch.Size([4, 256, 25, 42])
pool: torch.Size([4, 256, 13, 21])

忽略采样过程
最终输出应该为

class_logits: Tensor(8000,5) ->Tensor(2000x4,5) 5分类所以是5个,代表这个ROI在5个分类上的概率
box_regression:Tensor(8000,20) ->Tensor(2000x4,5x4) 20是因为对每一个分类都进行了一次边框预测

即使用proposals,features通过一系列卷积等操作计算class_logits,box_regression
注意!! proposals仅描述了一些边框信息,并不含图像信息,图像信息存储在特征图features上,故而需要组合proposals,features数据.
组合(box_roi_pool)完成之后
再接一些特征提取层(box_head),如MLP等
再独立进行2次全连接操作(box_predictor),即可得到分类和回归结果

对于以上例子RoIHeads模型精简如下(忽略了mask),不同特征提取层box_head

#使用双层MLP
  (box_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(7, 7), sampling_ratio=2)
  (box_head): TwoMLPHead(
    (fc6): Linear(in_features=12544, out_features=1024, bias=True)
    (fc7): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=5, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=20, bias=True)
  )

 #使用卷积+FC
 (box_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(7, 7), sampling_ratio=2)
  (box_head): FastRCNNConvFCHead(
    (0): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (1): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (2): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (3): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (4): Flatten(start_dim=1, end_dim=-1)
    (5): Linear(in_features=12544, out_features=256, bias=True)
    (6): ReLU(inplace=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=5, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=20, bias=True)
  )

附:完整模型描述如下

RoIHeads(
  (box_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(7, 7), sampling_ratio=2)
  (box_head): TwoMLPHead(
    (fc6): Linear(in_features=12544, out_features=1024, bias=True)
    (fc7): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=5, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=20, bias=True)
  )
  (mask_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(14, 14), sampling_ratio=2)
  (mask_head): MaskRCNNHeads(
    (0): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (1): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (2): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (3): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
  )
  (mask_predictor): MaskRCNNPredictor(
    (conv5_mask): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
    (relu): ReLU(inplace=True)
    (mask_fcn_logits): Conv2d(256, 5, kernel_size=(1, 1), stride=(1, 1))
  )
)

RoIHeads(
  (box_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(7, 7), sampling_ratio=2)
  (box_head): FastRCNNConvFCHead(
    (0): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (1): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (2): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (3): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (4): Flatten(start_dim=1, end_dim=-1)
    (5): Linear(in_features=12544, out_features=256, bias=True)
    (6): ReLU(inplace=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=5, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=20, bias=True)
  )
  (mask_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(14, 14), sampling_ratio=2)
  (mask_head): MaskRCNNHeads(
    (0): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (1): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (2): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (3): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
  )
  (mask_predictor): MaskRCNNPredictor(
    (conv5_mask): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
    (relu): ReLU(inplace=True)
    (mask_fcn_logits): Conv2d(256, 5, kernel_size=(1, 1), stride=(1, 1))
  )
)

3.1.2 后处理

得到的结果不能直接使用,需要经过一次后处理(nms),已去除同一区域重复的结果

3.2 前向传播

总的来说,我们需要将proposal即anchors对应到feature上,而且需要获得一致的图像尺寸输入.

proposal的尺寸不一致,但是都可以对应到feature上,直接对应会导致尺寸不一致
后面进行的卷积或者全连接,要求输入的尺寸必须一致
feature是fpn生成的特征金字塔,故存在多个尺度不同的特征图
依据proposal的尺寸不同,选则不同尺度的feature进行映射即可
对于某一proposal和与之对应的某一feature来说,需要映射出同一尺寸的结果,最简单方案就是对映射的结果,做一个指定尺寸的maxpooling操作,即ROIpooling
但是由于浮点型的问题,会导致偏差,所以ROIAlign方案会更准确
原理这篇讲的很清楚了: 图解 RoIAlign 以及在 PyTorch 中的使用（含代码示例）

3.2.0 采样与匹配GT

和2.2.4.1.0基本一致
参见: Mask/Faster R-Cnn中2次anchors/proposals匹配过程(assign_targets_to_anchors/proposals)(pytorch版本)

proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
"""
proposal: 采样后的proposal
matched_idx: proposal和gt的index的关联关系,注意,这里的0.存在2种可能,一种是负样本的,一种是真的0,需要结合labels看
labels: 即proposal对应的分类是哪一个,注意!!这里的0就代表背景,即负样本,是不会出现-1的,因为负样本也是分类的一种,即0分类(背景)
"""

这里的定位代码和之前的略有不同

  def assign_targets_to_proposals(self, proposals, gt_boxes, gt_labels):
        matched_idxs = []
        labels = []
        # gt_labels注意一下,是从1开始的,0是背景
        for proposals_in_image, gt_boxes_in_image, gt_labels_in_image in zip(proposals, gt_boxes, gt_labels):

            if gt_boxes_in_image.numel() == 0:
                # Background image
                device = proposals_in_image.device
                clamped_matched_idxs_in_image = torch.zeros(
                    (proposals_in_image.shape[0],), dtype=torch.int64, device=device
                )
                labels_in_image = torch.zeros((proposals_in_image.shape[0],), dtype=torch.int64, device=device)
            else:
                #  set to self.box_similarity when https://github.com/pytorch/pytorch/issues/27495 lands
                match_quality_matrix = box_ops.box_iou(gt_boxes_in_image, proposals_in_image)
                # 通过iou匹配对应的gt框是哪一个,
                # matched_idxs_in_image为对应的gt_boxes.
                # index形如[-1,  0,  1,  2,  3].-1为没匹配到
                matched_idxs_in_image = self.proposal_matcher(match_quality_matrix)

                # 将限制掉-1的idx存一份
                clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)

                # gt_labels_in_image是一个gt的类型的Tensor,例如[1,3,4,4],clamped_matched_idxs_in_image是索引值,例如torch.Size([2004]),
                # 对gt_labels_in_image求2004个索引对应的值,即得到2004个类型,
                # 按示例:结果labels_in_image其中1理论上应该最多,因为大量的-1被限制为了0,所以定位到了第一个值上,故而是1最多
                # 所以此时labels_in_image是不准确的,并不是每一个proposal都对应到了合适的分类,其中属于背景的proposal被定位到了分类1上
                labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
                labels_in_image = labels_in_image.to(dtype=torch.int64)


                # Label background (below the low threshold)
                # 求出属于背景的index,将labels_in_image错误的设置为分类0(背景).
                bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD
                labels_in_image[bg_inds] = 0

                # 这里实际是不适用的,因为此处的上下阈值都是0.5,不存在舍弃的数据
                # Label ignore proposals (between low and high thresholds)
                ignore_inds = matched_idxs_in_image == self.proposal_matcher.BETWEEN_THRESHOLDS
                labels_in_image[ignore_inds] = -1  # -1 is ignored by sampler

            # 实际返回的值中labels_in_image是准确的标记了每一个proposal属于哪一个分类(0是背景)
            # matched_idxs是不准确的值,因为使用的是clamped_matched_idxs_in_image,
            # 其中0的很大一部分是属于未匹配到的,而不是第0个gt框,所以使用时需要以labels_in_image为准
            matched_idxs.append(clamped_matched_idxs_in_image)
            labels.append(labels_in_image)
        return matched_idxs, labels

3.2.1 ROIAlign/特征提取

代码参见:

from torchvision.ops import MultiScaleRoIAlign
from torchvision.models.detection.faster_rcnn import FastRCNNConvFCHead
# 参数featmap_names使用那几个特征图层进行映射,output_size输出的结果尺寸.sampling_ratio,即在输出的的每一个值,使用多少个点作为采样计算
self.box_roi_pool = MultiScaleRoIAlign(featmap_names=["0", "1", "2", "3"], output_size=7, sampling_ratio=4)
# 特征提取层,input_size的深度和特征图一致,输出尺寸和RoIAlign一致,执行4次3x3的卷积操作,通道均为256
self.box_head = FastRCNNConvFCHead((256, 7, 7), [256, 256, 256, 256], [256],
                                           norm_layer=torch.nn.BatchNorm2d)
 ......
# 多层级roiAlign,按ROI的大小,选择特定的层级的featureMap,
# 执行RoiAlign,得到proposals个featureMap深度的,指定尺寸的,Tensor,形如Tensor(2000,256,7,7)
box_features = self.box_roi_pool(features, proposals, image_sizes)
# 提取更多特征,之后接全连接(如输出为256),即每一个ROI,都用一个(256长的数据作为特征)
box_features = self.box_head(box_features)

3.2.2 全连接预测结果

也没啥好说的,使用2个全连接得到分类得分,以及每个分类的边框回归

class FastRCNNPredictor(nn.Module):
    def __init__(self, representation_size, num_classes):
        super().__init__()
        self.cls_score = nn.Linear(in_features=representation_size, out_features=num_classes)
        self.bbox_pred = nn.Linear(in_features=representation_size, out_features=num_classes * 4)

    def forward(self, x):
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)
        return scores, bbox_deltas

3.2.3 RoIHeads(此时)

class RoIHeads(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        # 采样代码,和之前的采样几乎一样,直接复制源码的上来用
        self.proposal_matcher = Matcher(0.5, 0.5, allow_low_quality_matches=False)
        self.fg_bg_sampler = BalancedPositiveNegativeSampler(512, 0.25)

        # 参数featmap_names使用那几个特征图层进行映射,output_size输出的结果尺寸.sampling_ratio,即在输出的的每一个值,使用多少个点作为采样计算
        self.box_roi_pool = MultiScaleRoIAlign(featmap_names=["0", "1", "2", "3"], output_size=7, sampling_ratio=4)
        # 特征提取层,input_size的深度和特征图一致,输出尺寸和RoIAlign一致,执行4次3x3的卷积操作,通道均为256
        self.box_head = FastRCNNConvFCHead((256, 7, 7), [256, 256, 256, 256], [256],
                                           norm_layer=torch.nn.BatchNorm2d)
        # 此处的256为box_head的输出深度,得到预测结果
        self.box_predictor = FastRCNNPredictor(256, num_classes)



    def forward(self, features, proposals, image_sizes, targets):
        # if self.training:
        # 训练期间进行采样
        proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
        # else:
        #     labels = None
        #     regression_targets = None
        #     matched_idxs = None

        # 多层级roiAlign,按ROI的大小,选择特定的层级的featureMap,
        # 执行RoiAlign,得到proposals个featureMap深度的,指定尺寸的,Tensor,形如Tensor(2000,256,7,7)
        box_features = self.box_roi_pool(features, proposals, image_sizes)
        # 提取更多特征,之后接全连接(如输出为256),即每一个ROI,都用一个(256长的数据作为特征)
        box_features = self.box_head(box_features)
        # 预测分类和每个分类上的回归
        class_logits, box_regression = self.box_predictor(box_features)

    def select_training_samples(self, proposals, targets):
        if targets is None:
            raise ValueError("targets should not be None")
        dtype = proposals[0].dtype
        device = proposals[0].device

        gt_boxes = [t["boxes"].to(dtype) for t in targets]
        gt_labels = [t["labels"] for t in targets]

        # append ground-truth bboxes to propos
        proposals = [torch.cat((proposal, gt_box)) for proposal, gt_box in zip(proposals, gt_boxes)]

        # get matching gt indices for each proposal
        matched_idxs, labels = self.assign_targets_to_proposals(proposals, gt_boxes, gt_labels)
        # sample a fixed proportion of positive-negative proposals
        sampled_inds = self.subsample(labels)
        matched_gt_boxes = []
        num_images = len(proposals)
        for img_id in range(num_images):
            img_sampled_inds = sampled_inds[img_id]
            proposals[img_id] = proposals[img_id][img_sampled_inds]
            labels[img_id] = labels[img_id][img_sampled_inds]
            matched_idxs[img_id] = matched_idxs[img_id][img_sampled_inds]

            gt_boxes_in_image = gt_boxes[img_id]
            if gt_boxes_in_image.numel() == 0:
                gt_boxes_in_image = torch.zeros((1, 4), dtype=dtype, device=device)
            matched_gt_boxes.append(gt_boxes_in_image[matched_idxs[img_id]])

        regression_targets = RPN.encode(matched_gt_boxes, proposals)
        return proposals, matched_idxs, labels, regression_targets

    def subsample(self, labels):
        # type: (List[Tensor]) -> List[Tensor]

        sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
        sampled_inds = []
        for img_idx, (pos_inds_img, neg_inds_img) in enumerate(zip(sampled_pos_inds, sampled_neg_inds)):
            img_sampled_inds = torch.where(pos_inds_img | neg_inds_img)[0]
            sampled_inds.append(img_sampled_inds)
        return sampled_inds

    def assign_targets_to_proposals(self, proposals, gt_boxes, gt_labels):
        matched_idxs = []
        labels = []
        for proposals_in_image, gt_boxes_in_image, gt_labels_in_image in zip(proposals, gt_boxes, gt_labels):

            if gt_boxes_in_image.numel() == 0:
                # Background image
                device = proposals_in_image.device
                clamped_matched_idxs_in_image = torch.zeros(
                    (proposals_in_image.shape[0],), dtype=torch.int64, device=device
                )
                labels_in_image = torch.zeros((proposals_in_image.shape[0],), dtype=torch.int64, device=device)
            else:
                #  set to self.box_similarity when https://github.com/pytorch/pytorch/issues/27495 lands
                match_quality_matrix = box_ops.box_iou(gt_boxes_in_image, proposals_in_image)
                matched_idxs_in_image = self.proposal_matcher(match_quality_matrix)

                clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)

                labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
                labels_in_image = labels_in_image.to(dtype=torch.int64)

                # Label background (below the low threshold)
                bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD
                labels_in_image[bg_inds] = 0

                # Label ignore proposals (between low and high thresholds)
                ignore_inds = matched_idxs_in_image == self.proposal_matcher.BETWEEN_THRESHOLDS
                labels_in_image[ignore_inds] = -1  # -1 is ignored by sampler

            matched_idxs.append(clamped_matched_idxs_in_image)
            labels.append(labels_in_image)
        return matched_idxs, labels


class FastRCNNPredictor(nn.Module):
    def __init__(self, representation_size, num_classes):
        super().__init__()
        self.cls_score = nn.Linear(in_features=representation_size, out_features=num_classes)
        self.bbox_pred = nn.Linear(in_features=representation_size, out_features=num_classes * 4)

    def forward(self, x):
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)
        return scores, bbox_deltas

3.3 计算loss

和rpn的loss几乎完全一致,除了分类用的交叉熵函数不太一样.

# 计算分类和回归损失
if self.training:
	loss_classifier, loss_box_reg = self.fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
	losses = {"loss_classifier": loss_classifier, "loss_box_reg": loss_box_reg}
.......

@staticmethod
def fastrcnn_loss(class_logits, box_regression, labels, regression_targets):

    labels = torch.cat(labels, dim=0)
    regression_targets = torch.cat(regression_targets, dim=0)

    classification_loss = nn.functional.cross_entropy(class_logits, labels)

    # get indices that correspond to the regression targets for
    # the corresponding ground truth labels, to be used with
    # advanced indexing
    sampled_pos_inds_subset = torch.where(labels > 0)[0]
    labels_pos = labels[sampled_pos_inds_subset]
    N, num_classes = class_logits.shape
    box_regression = box_regression.reshape(N, box_regression.size(-1) // 4, 4)

    box_loss = nn.functional.smooth_l1_loss(
        box_regression[sampled_pos_inds_subset, labels_pos],
        regression_targets[sampled_pos_inds_subset],
        beta=1 / 9,
        reduction="sum",
    )
    box_loss = box_loss / labels.numel()

    return classification_loss, box_loss

3.4 mask的前向传播与loss计算

mask进行前向传播与loss计算是单独一个代码块存在,附加在fasterRCNN的ROIhead上的
在训练阶段和推理阶段,计算的目标是不同的,

对于训练阶段只需要计算损失,前向传播阶段只需要计算到mask_logits,
在推理阶段不需要计算损失,但需要计算预测(概率而非mask_logits).

3.4.1 调用结构

我们要预测结果,类似的,和roi_heads计算boxes,以及分类相似, 调用代码

mask_features = self.mask_roi_pool(features, mask_proposals, image_shapes)
mask_features = self.mask_head(mask_features)
mask_logits = self.mask_predictor(mask_features)

获取mask_proposals,是proposal的子集(除去无对象的)
将mask_proposals和features进行roiAlign得到mask_features,此处的尺寸一般会大于box的尺寸,需要更精细的mask可以使用更精细的输出 Tensor(49,256,14,14)
使用mask_features传入mask_head进行特征提取 Tensor(49,256,14,14)
是用提取后的mask_features传入mask_predictor进行预测,输出mask_logits,此时的mask_logits应该形如:Tensor(49,5,28,28) ,5是分类,即5个分类的mask信息,注意每个mask信息是1维的,存储的是概率值,14->28是因为ConvTranspose2d执行反卷积了
总体和对boxes的预测并无太大区别,结构如下

 (mask_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(14, 14), sampling_ratio=2)
  (mask_head): MaskRCNNHeads(
    (0): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (1): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (2): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (3): Conv2dNormActivation(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
  )
  (mask_predictor): MaskRCNNPredictor(
    (conv5_mask): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
    (relu): ReLU(inplace=True)
    (mask_fcn_logits): Conv2d(256, 5, kernel_size=(1, 1), stride=(1, 1))
  )

3.4.1.1 获取mask_proposals

为了预测或者计算损失,总需要取一定的建议区域,即存在对象的建议区域
取出所有有对象的proposal,即mask_proposals.

对于预测阶段,则是所有proposal
对于训练阶段,先依据labels筛选出所有有对象的proposal信息(mask_proposals),以及计算损失使用的: 对应的分类(labels)和下标(pos_matched_idxs).

labels在训练阶段是采样时候计算的,0代表背景,1~其他代表对象分类.

if self.training:
     num_images = len(proposals)
     mask_proposals = []
     pos_matched_idxs = []
     for img_id in range(num_images):
         pos = torch.where(labels[img_id] > 0)[0]
         mask_proposals.append(proposals[img_id][pos])
         pos_matched_idxs.append(matched_idxs[img_id][pos])

3.4.1.2 MaskRCNNHeads/MaskRCNNPredictor

MaskRCNNHeads和之前一致,不再说了,都是为了提取更丰富的特征,需要注意的是,
不需要全连接了,故使用mask_head = MaskRCNNHeads(256, [256, 256, 256, 256], 1, norm_layer=nn.BatchNorm2d)
MaskRCNNPredictor与box_head不同的是,这里使用了一个反卷积,以及一个卷积计算,进行结果计算.反卷积的目的是提高输出的精度.
例如mask_head输出为14x14的特征图.则会被放大为28x28的结果.然后使用28x28的结果去对应原图像的大小(Resize),在进行阈值化和二值化,才能生成最终的mask.如果使用14x14的精度就会过低,边缘会偏差的较大

class MaskRCNNPredictor(nn.Sequential):
    def __init__(self, in_channels, dim_reduced, num_classes):
        super().__init__(
            OrderedDict(
                [
                    ("conv5_mask", nn.ConvTranspose2d(in_channels, dim_reduced, 2, 2, 0)),
                    ("relu", nn.ReLU(inplace=True)),
                    ("mask_fcn_logits", nn.Conv2d(dim_reduced, num_classes, 1, 1, 0)),
                ]
            )
        )

        for name, param in self.named_parameters():
            if "weight" in name:
                nn.init.kaiming_normal_(param, mode="fan_out", nonlinearity="relu")

3.4.2 训练阶段: loss计算(maskrcnn_loss)

定位proposal的分类,使用mask_matched_idxs取出mask_logits对应分类的mask信息,28x28尺寸的
取出gt_masks的信息,原尺寸,通过roialign映射到28x28尺寸
计算二元交叉熵

def maskrcnn_loss(mask_logits, proposals, gt_masks, gt_labels, mask_matched_idxs):
    # type: (Tensor, List[Tensor], List[Tensor], List[Tensor], List[Tensor]) -> Tensor
    """
    Args:
        proposals (list[BoxList])
        mask_logits (Tensor) 预测结果,(49,5,28,28),每个分类下的结果
        mask_matched_idxs proposal和gt的映射关系
        gt_masks 真实的mask信息,注意是全尺寸的,和预测的28x28尺度是不一致的
        gt_labels 真实分类

    Return:
        mask_loss (Tensor): scalar tensor containing the loss
    """

    discretization_size = mask_logits.shape[-1]
    labels = [gt_label[idxs] for gt_label, idxs in zip(gt_labels, mask_matched_idxs)]
    # 将真实框投影到28x28的尺度上,使用的还是roi_align方法
    mask_targets = [
        project_masks_on_boxes(m, p, i, discretization_size) for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
    ]

    labels = torch.cat(labels, dim=0)
    mask_targets = torch.cat(mask_targets, dim=0)

    # torch.mean (in binary_cross_entropy_with_logits) doesn't
    # accept empty tensors, so handle it separately
    if mask_targets.numel() == 0:
        return mask_logits.sum() * 0
		# 取出mask_logits中label对应的mask,与mask_targets 做二元交叉熵记为损失
    mask_loss = F.binary_cross_entropy_with_logits(
        mask_logits[torch.arange(labels.shape[0], device=labels.device), labels], mask_targets
    )
    return mask_loss

3.4.3 推理阶段: 推理结果(maskrcnn_inference)

1.根据CNN的结果，取最大概率类别对应的mask进行后处理。
2.返回概率值,二分类问题所以采用sigmoid函数处理

def maskrcnn_inference(x, labels):
    # type: (Tensor, List[Tensor]) -> List[Tensor]
    """
    From the results of the CNN, post process the masks
    by taking the mask corresponding to the class with max
    probability (which are of fixed size and directly output
    by the CNN) and return the masks in the mask field of the BoxList.

    Args:
        x (Tensor): the mask logits
        labels (list[BoxList]): bounding boxes that are used as
            reference, one for ech image

    Returns:
        results (list[BoxList]): one BoxList for each image, containing
            the extra field mask
    """
    mask_prob = x.sigmoid()

    # select masks corresponding to the predicted classes
    num_masks = x.shape[0]
    boxes_per_image = [label.shape[0] for label in labels]
    labels = torch.cat(labels)
    index = torch.arange(num_masks, device=labels.device)
    mask_prob = mask_prob[index, labels][:, None]
    mask_prob = mask_prob.split(boxes_per_image, dim=0)

    return mask_prob

3.5返回结果

detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
"""
detections 仅在非训练时返回,即预测结果,形如:
boxes: torch.Size([100, 4]) 预测边框(偏移)
labels: torch.Size([100]) 预测标签(最高概率)
scores: torch.Size([100]) 预测标签的得分
masks: torch.Size([100, 1, 28, 28]) 预测mask(最高概率)(ROI尺度上)

detector_losses 仅在训练时返回
loss_classifier: torch.Size([]) 分类损失
loss_box_reg: torch.Size([]) 边框回归损失
loss_mask: torch.Size([]) 掩膜损失

"""