调试YOLOv8 Segment head

unicorn832

已于 2024-08-23 16:41:18 修改

阅读量1.5k

点赞数 15

文章标签： YOLO

于 2024-08-23 12:16:36 首次发布

本文链接：https://blog.csdn.net/weixin_70267235/article/details/141454453

版权

yolov8的分割头采用的是yolact，（可以这篇博客简单理解一下：https://zhuanlan.zhihu.com/p/376347955）

在看这部分源码之前，我以为的实例分割任务应该就是目标检测加上语义分割（因为mask r-cnn就是这样做的），但是当我调试这部分代码时，发现并不是这样。先看源码：

源码

class Segment(Detect):
    """YOLOv8 Segment head for segmentation models."""

    def __init__(self, nc=80, nm=32, npr=256, ch=()):
        """Initialize the YOLO model attributes such as the number of masks, prototypes, and the convolution layers."""
        super().__init__(nc, ch)
        self.nm = nm  # number of masks
        self.npr = npr  # number of protos
        self.proto = Proto(ch[0], self.npr, self.nm)  # protos
        self.detect = Detect.forward

        c4 = max(ch[0] // 4, self.nm)
        self.cv4 = nn.ModuleList(nn.Sequential(Conv(x, c4, 3), Conv(c4, c4, 3), nn.Conv2d(c4, self.nm, 1)) for x in ch)

    def forward(self, x):
        """Return model outputs and mask coefficients if training, otherwise return outputs and mask coefficients."""
        p = self.proto(x[0])  # mask protos
        bs = p.shape[0]  # batch size

        mc = torch.cat([self.cv4[i](x[i]).view(bs, self.nm, -1) for i in range(self.nl)], 2)  # mask coefficients
        x = self.detect(self, x)
        if self.training:
            return x, mc, p
        return (torch.cat([x, mc], 1), p) if self.export else (torch.cat([x[0], mc], 1), (x[1], mc, p))

init方法

继承自Detect类，多了一些属性：

这个proto是用于生成掩码原型的，等会在forward再仔细看一下。

forward方法

使用 `proto` 层从输入特征图 `x[0]` 中得到原型掩码 `p`

p = self.proto(x[0])  # mask protos

Proto：

class Proto(nn.Module):
    """YOLOv8 mask Proto module for segmentation models."""

    def __init__(self, c1, c_=256, c2=32):
        """
        Initializes the YOLOv8 mask Proto module with specified number of protos and masks.

        Input arguments are ch_in, number of protos, number of masks.
        """
        super().__init__()
        self.cv1 = Conv(c1, c_, k=3)
        self.upsample = nn.ConvTranspose2d(c_, c_, 2, 2, 0, bias=True)  # nn.Upsample(scale_factor=2, mode='nearest')
        self.cv2 = Conv(c_, c_, k=3)
        self.cv3 = Conv(c_, c2)

    def forward(self, x):
        """Performs a forward pass through layers using an upsampled input image."""
        return self.cv3(self.cv2(self.upsample(self.cv1(x))))

生成了32个160*128的protos，用下面这个脚本可视化一下：

import cv2
import os
vis_proto = self.cv3(self.cv2(self.upsample(self.cv1(x)))).cpu()
for i in range(vis_proto.shape[1]):
    channel_data = vis_proto[0, i, :, :].numpy()
    out_dir = r'D:\360MoveData\Users\UNICORN\Desktop\ultralytics-main\ultralytics\nn\modules\vis_protos'
    os.makedirs(out_dir, exist_ok=True)
    save_path = os.path.join(out_dir, f'{i}.jpg')
    cv2.imwrite(save_path, channel_data)

得到了32张不明所以的图像...（😓）

这32张protos作为output1输出，可能与后面的mask coefficients进行线性组合成最后的掩码。

计算掩码系数 `mc`

这是通过将多个特征图通过对应的卷积模块 cv4 得到的

mc = torch.cat([self.cv4[i](x[i]).view(bs, self.nm, -1) for i in range(self.nl)], 2)  # mask coefficients

调用Detect的 `forward` 方法来获取检测结果 `x`

x = self.detect(self, x)

如果模型处于训练模式，则返回检测结果 `x`、掩码系数 `mc` 和原型掩码 `p`

if self.training:
    return x, mc, p

如果不是训练模式（即推理或评估模式），则根据 `self.export` 的值决定输出格式：

如果 export 为 True，则返回检测结果与掩码系数拼接后的张量以及原型掩码。
如果 export 为 False，则返回检测结果与掩码系数拼接后的张量以及一个包含检测结果、掩码系数和原型掩码的元组。

return (torch.cat([x, mc], 1), p) if self.export else (torch.cat([x[0], mc], 1), (x[1], mc, p))

因为我是推理模式，所以返回torch.cat([x[0], mc], 1), (x[1], mc, p)

到这里，推理完毕。

后处理

我们逐过程调试，会来到后处理这个地方：

# Postprocess
with profilers[2]:
    self.results = self.postprocess(preds, im, im0s)

def postprocess(self, preds, img, orig_imgs):
    """Applies non-max suppression and processes detections for each image in an input batch."""
    p = ops.non_max_suppression(preds[0],
                                self.args.conf,
                                self.args.iou,
                                agnostic=self.args.agnostic_nms,
                                max_det=self.args.max_det,
                                nc=len(self.model.names),
                                classes=self.args.classes)

    if not isinstance(orig_imgs, list):  # input images are a torch.Tensor, not a list
        orig_imgs = ops.convert_torch2numpy_batch(orig_imgs)

    results = []
    proto = preds[1][-1] if len(preds[1]) == 3 else preds[1]  # second output is len 3 if pt, but only 1 if exported
    for i, pred in enumerate(p):
        orig_img = orig_imgs[i]
        img_path = self.batch[0][i]
        if not len(pred):  # save empty boxes
            masks = None
        elif self.args.retina_masks:
            pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
            masks = ops.process_mask_native(proto[i], pred[:, 6:], pred[:, :4], orig_img.shape[:2])  # HWC
        else:
            masks = ops.process_mask(proto[i], pred[:, 6:], pred[:, :4], img.shape[2:], upsample=True)  # HWC
            pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
        results.append(Results(orig_img, path=img_path, names=self.model.names, boxes=pred[:, :6], masks=masks))
    return results

非极大抑制

这里之后单独出一期debug吧

经过nms后，只剩43个实例了：

这个38包含了bboxes信息、掩码信息、conf、class

处理原始图像

转换为np格式

看一下原始图像

处理每个图像的检测结果

遍历每一张图像及其对应的预测结果。
根据是否使用 retina masks（高分辨率掩码）来处理掩码。
- 如果 self.args.retina_masks 为 True:
  - 将预测的边界框从模型输出的尺寸缩放到原始图像尺寸。
  - 使用 ops.process_mask_native 处理掩码，得到原图尺寸的掩码。
- 否则:
  - 使用 ops.process_mask 处理掩码，然后将预测的边界框从模型输出的尺寸缩放到原始图像尺寸。

elif self.args.retina_masks:
    pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
    masks = ops.process_mask_native(proto[i], pred[:, 6:], pred[:, :4], orig_img.shape[:2])  # HWC
else:
    masks = ops.process_mask(proto[i], pred[:, 6:], pred[:, :4], img.shape[2:], upsample=True)  # HWC
    pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)

我没有使用高分辨率推理，所以直接进入ops.process_mask：

ops.process_mask

def process_mask(protos, masks_in, bboxes, shape, upsample=False):
    """
    Apply masks to bounding boxes using the output of the mask head.

    Args:
        protos (torch.Tensor): A tensor of shape [mask_dim, mask_h, mask_w].
        masks_in (torch.Tensor): A tensor of shape [n, mask_dim], where n is the number of masks after NMS.
        bboxes (torch.Tensor): A tensor of shape [n, 4], where n is the number of masks after NMS.
        shape (tuple): A tuple of integers representing the size of the input image in the format (h, w).
        upsample (bool): A flag to indicate whether to upsample the mask to the original image size. Default is False.

    Returns:
        (torch.Tensor): A binary mask tensor of shape [n, h, w], where n is the number of masks after NMS, and h and w
            are the height and width of the input image. The mask is applied to the bounding boxes.
    """

    c, mh, mw = protos.shape  # CHW
    ih, iw = shape
    masks = (masks_in @ protos.float().view(c, -1)).sigmoid().view(-1, mh, mw)  # CHW

    downsampled_bboxes = bboxes.clone()
    downsampled_bboxes[:, 0] *= mw / iw
    downsampled_bboxes[:, 2] *= mw / iw
    downsampled_bboxes[:, 3] *= mh / ih
    downsampled_bboxes[:, 1] *= mh / ih

    masks = crop_mask(masks, downsampled_bboxes)  # CHW
    if upsample:
        masks = F.interpolate(masks[None], shape, mode='bilinear', align_corners=False)[0]  # CHW
    return masks.gt_(0.5)

计算掩码

从 protos 张量中提取维度信息：c, mh, mw。
使用矩阵乘法将 masks_in 和 protos 相乘，得到形状为 [n, mh, mw] 的张量，并应用 sigmoid 函数将结果限制在 0 到 1 之间。这一步生成了掩码张量。

这个32指的是之前生成的proto的个数，每个proto都要乘mc

现在我们可视化一下masks，输入下面的脚本来可视化：

import cv2
import os
vis_masks = masks
for i in range(vis_masks.shape[0]):
    channel_data = vis_masks[i, :, :].numpy()
    channel_data[channel_data > 0.5] = 255
    out_dir = r'D:\360MoveData\Users\UNICORN\Desktop\ultralytics-main\ultralytics\utils\vis_masks'
    os.makedirs(out_dir, exist_ok=True)
    save_path = os.path.join(out_dir, f'{i}.jpg')
    cv2.imwrite(save_path, channel_data)

就不过多展示了，这个好像像那么回事了，但是总感觉还差点，继续往下调试吧。

调整检测框大小

创建检测框的副本 downsampled_bboxes。
根据输入图像和原型掩码的大小比例调整检测框的坐标。

downsampled_bboxes = bboxes.clone()
downsampled_bboxes[:, 0] *= mw / iw
downsampled_bboxes[:, 2] *= mw / iw
downsampled_bboxes[:, 3] *= mh / ih
downsampled_bboxes[:, 1] *= mh / ih

裁剪掩码

使用 crop_mask 函数将掩码裁剪到相应的检测框大小。这一步是为了使掩码与检测框对齐。

masks = crop_mask(masks, downsampled_bboxes)

上采样掩码

如果 upsample 为 True，则使用双线性插值将掩码张量上采样到原始图像大小。

if upsample:
    masks = F.interpolate(masks[None], shape, mode='bilinear', align_corners=False)[0]  # CHW

同样可视化一下：

生成二值掩码

将掩码张量中的值大于 0.5 的位置设为 True，其余设为 False，从而得到二值掩码。

构建结果对象:

如果没有检测到任何对象，masks 设为 None。
否则，将处理后的边界框和掩码信息封装到 Results 类的实例中。
Results 类应该包含了原始图像、图像路径、类别名称、边界框信息和掩码信息。

损失计算

都到这里了就提一下损失计算吧

loss = F.binary_cross_entropy_with_logits(pred_mask, gt_mask, reduction='none')

进行BCE计算，可视化一下pred_mask和gt_mask