【论文笔记】图像边缘精细分割 PointRend: Image Segmentation as Rendering-CSDN博客

本文链接：https://blog.csdn.net/muyijames/article/details/115585147

本文深入剖析PointRend模块，介绍其如何通过选择并预测最不确定的点来提高图像分割精度，尤其在边缘细节上。文章详细解释了点选择策略、特征提取与分类预测过程，并提供了源代码分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

1 综述
2 PointRend Mudule解析
3 源代码解析
4 实验结果
4 参考文献

1 综述

今天分享一篇何凯明2020年的论文《PointRend: Image Segmentation as Rendering》，文章主要解决的问题就是在图像分割任务中边缘不够精细的问题。

因为模型最容易误判的 pixel 基本上都在物体边缘，边缘只占了整个物体中非常小的一部分。所以基于这样的一个想法，作者提出可以每次在预测出来的 mask 中只选择 Top N 最不确定的点位置进行单独预测，其他部分的像素点采用直接插值方法，这样就既可以解决了精度问题，还保证了内存与计算量尽可能的小。

论文地址：《PointRend: Image Segmentation as Rendering》

2 PointRend Mudule解析

PointRend 模块包含三个主要组件：

1、Point Selection Strategy：选择少量真值点执行预测，避免对高分辨率输出网格中的所有像素进行过度计算；

2、Point-wise feature Representation：使用每个选中点在 f 规则网格上的 4 个最近邻点，利用双线性内插计算真值点的特征。因此，该方法能够利用 f 的通道维度中编码的子像素信息，来预测比 f 规则网格分辨率高的分割预测；

3、Point Head：一个小型神经网络，用于基于逐点特征表示预测标签，它独立于每个点。每个细分点的特征可以通过 Bilinear 插值得到，每个位置上的 classifier 通过一个简单的MLP来实现。这其实是等价于用一个1*1 的 conv 来预测，但是对于中心很确定的点并不计算；

2.1 点选择策略

训练期间 PointRend 需要选择训练点来构造 point-wise features，以训练point head。原则上，点的选择策略可以类似于推理中使用的细分策略。但是， subdivision 引入了循环迭代，这对使用反向传播训练的神经网络不太友好。因此训练阶段使用了基于随机采样的非迭代策略；

推理阶段：

在每次迭代中，PointRend使用双线性插值对之前预测的分割 Mask 进行上采样，然后在这个密度更大的网格上选择N个最不确定的点(例如，对于二值预测，概率接近0.5的点)。然后，PointRend为这N个点中的每一个点计算点特征表示，并预测它们的标签。这个过程是重复的，直到分割是上采样到所需的分辨率。一个coarse-to-fine的过程；
在这里插入图片描述

训练阶段：

采用随机采样的非迭代策略来进行，具体如下：

（1）：我们通过从均匀分布中随机抽样 kN 点（k>1）来过度生成候选点；

（2）：从 kN 个点中选取βN（β ∈[0,1]）个最不确定的点。使用0.5与概率之间的距离作为逐点不确定性度量，概率指的是对真实值的粗略预测概率。原文：（We use the distance between 0.5 and the probability of the ground truth class interpolated from the coarse prediction as the point-wise uncertainty measure.）；

（3）：在从均匀分布中选取 (1 - β)N 个点；
在这里插入图片描述

2.2 点的特征提取

PointRend 通过组合低层特征 (fine-grained features) 和高层特征 (coarse prediction)，在选定的点上构造逐点特征。

Fine-grained features：

为了让PointRend呈现出精细的分割细节，研究人员为CNN特征图中的每个采样点提取了特征向量。细粒度特征虽然可以解析细节，但也存在两方面的不足：
（1）不包含特定区域信息，对于实例分割任务，就可能在同一点上预测出不同的标签。
（2）用于细粒度特征的特征映射，可能仅包含相对较低级别的信息。

Coarse prediction features：

来自于现有网络架构的输出，提供更多全局背景，用于对 fine-grained features 进行补充。以实例分割为例，coarse prediction可以是Mask R-CNN中 7×7 轻量级mask head的输出。
在这里插入图片描述

2.3 点的分类预测

通过一个多层感知机（MLP）来对每个被选中的点进行分类预测，所有点共享MLP的权重，MLP可以通过标准的任务特定的分段损失来训练。

3 源代码解析

3.1 Points Selection

def sampling_points(mask, N, k=3, beta=0.75, training=True):
    """
    主要思想：根据粗糙的预测结果，找出不确定的像素点
    :param mask: 粗糙的预测结果（out）   eg.[2, 19, 48, 48]
    :param N: 不确定点个数（train：N = 图片的尺寸/16, test: N = 8096）    eg. N=48
    :param k: 超参
    :param beta: 超参
    :param training:
    :return: 不确定点的位置坐标  eg.[2, 48, 2]
    """
    assert mask.dim() == 4, "Dim must be N(Batch)CHW"   #this mask is out(coarse)
    device = mask.device
    B, _, H, W = mask.shape   #first: mask[1, 19, 48, 48]
    mask, _ = mask.sort(1, descending=True) #_ : [1, 19, 48, 48],按照每一类的总体得分排序

    if not training:
        H_step, W_step = 1 / H, 1 / W
        N = min(H * W, N)
        uncertainty_map = -1 * (mask[:, 0] - mask[:, 1])
        #mask[:, 0]表示每个像素最有可能的分类，mask[:, 1]表示每个像素次有可能的分类，当一个像素
        #即是最有可能的又是次有可能的，则证明它不好预测，对应的uncertainty_map就相对较大
        _, idx = uncertainty_map.view(B, -1).topk(N, dim=1) #id选出最不好预测的N个点

        points = torch.zeros(B, N, 2, dtype=torch.float, device=device)
        points[:, :, 0] = W_step / 2.0 + (idx  % W).to(torch.float) * W_step    #点的横坐标
        points[:, :, 1] = H_step / 2.0 + (idx // W).to(torch.float) * H_step    #点的纵坐标
        return idx, points  #idx:48 || points:[1, 48, 2]

3.2 Point-wise Representation and Point Head

挑选出的不确定点所在图片的相对位置坐标来找到对应的特征点，将此点对应的特征向量与此点的粗糙预测结果合并，然后通过一个MLP进行细分预测，代码如下：

##训练阶段
def forward(self, x, res2, out):
        """
        主要思路：
        通过 out（粗糙预测）计算出top N 个不稳定的像素点，针对每个不稳定像素点得到在res2（fine）
        和out（coarse）中对应的特征，组合N个不稳定像素点对应的fine和coarse得到rend，再通过mlp得到更准确的预测;
        :param x: 表示输入图片的特征     eg.[2, 3, 768, 768]
        :param res2: 表示xception的第一层特征输出     eg.[2, 256, 192, 192]（下采样4倍）
        :param out: 表示经过级联空洞卷积提取的特征的粗糙预测    eg.[2, 19, 48, 48]（下采样16倍）
        :return: rend:更准确的预测，points：不确定像素点的位置
        """
        """
        1. Fine-grained features are interpolated from res2 for DeeplabV3
        2. During training we sample as many points as there are on a stride 16 feature map of the input
        3. To measure prediction uncertainty
           we use the same strategy during training and inference: the difference between the most
           confident and second most confident class probabilities.
        """
        if not self.training:
            return self.inference(x, res2, out)
		#获得不确定点的坐标
        points = sampling_points(out, x.shape[-1] // 16, self.k, self.beta) #out:[2, 19, 48, 48] || x:[2, 3, 768, 768] || points:[2, 48, 2]
		#根据不确定点的坐标，得到对应的coarse feature;
        coarse = point_sample(out, points, align_corners=False) #[2, 19, 48]
        #根据不确定点的坐标，得到对应的fine feature;
        fine = point_sample(res2, points, align_corners=False)  #[2, 256, 48]
		#将对应的特征向量合并;
        feature_representation = torch.cat([coarse, fine], dim=1)   #[2, 275, 48]
		#使用MLP进行细分预测;
        rend = self.mlp(feature_representation) #[2, 19, 48]

        return {"rend": rend, "points": points}

##推理阶段
@torch.no_grad()
    def inference(self, x, res2, out):
        """
        输入：
        x:[1, 3, 768, 768],表示输入图片的特征
        res2:[1, 256, 192, 192]，表示xception的第一层特征输出（下采样4倍）
        out:[1, 19, 48, 48],表示经过级联空洞卷积提取的特征的粗糙预测（下采样16倍）
        输出：
        out:[1,19,768,768],表示最终图片的预测
        主要思路：
        通过 out计算出top N = 8096 个不稳定的像素点，针对每个不稳定像素点得到在res2（fine）
        和out（coarse）中对应的特征，组合8096个不稳定像素点对应的fine和coarse得到rend，
        再通过mlp得到更准确的预测，迭代至rend的尺寸大小等于输入图片的尺寸大小
        """
        """
        During inference, subdivision uses N=8096
        (i.e., the number of points in the stride 16 map of a 1024×2048 image)
        """
        num_points = 8096
        
        while out.shape[-1] != x.shape[-1]: #out:[1, 19, 48, 48], x:[1, 3, 768, 768]
        	#每一次预测均会扩大2倍像素，直至与原图像素大小一致
            out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=True)   #out[1, 19, 48, 48]

            points_idx, points = sampling_points(out, num_points, training=self.training)   #points_idx:8096 || points:[1, 8096, 2]

            coarse = point_sample(out, points, align_corners=False) #coarse:[1, 19, 8096]   表示8096个不稳定像素点根据高级特征得出的对应的类别
            fine = point_sample(res2, points, align_corners=False)  #fine:[1, 256, 8096]    表示8096个不稳定像素点根据低级特征得出的对应类别

            feature_representation = torch.cat([coarse, fine], dim=1)   #[1, 275, 8096] 表示8096个不稳定像素点合并fine和coarse的特征

            rend = self.mlp(feature_representation) #[1, 19, 8096]

            B, C, H, W = out.shape  #first:[1, 19, 128, 256]
            points_idx = points_idx.unsqueeze(1).expand(-1, C, -1)  #[1, 19, 8096]
            out = (out.reshape(B, C, -1)
                      .scatter_(2, points_idx, rend)    #[1, 19, 32768]
                      .view(B, C, H, W))    #[1, 19, 128, 256]
            
        return {"fine": out}



import torch.nn.functional as F
def point_sample(input, point_coords, **kwargs):
    """
    A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.
    Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside
    [0, 1] x [0, 1] square.
    Args:
        input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.
        point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains
        [0, 1] x [0, 1] normalized point coordinates.
    Returns:
        output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains
            features for points in `point_coords`. The features are obtained via bilinear
            interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.
    """
    add_dim = False
    if point_coords.dim() == 3:
        add_dim = True
        point_coords = point_coords.unsqueeze(2)
    output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
    if add_dim:
        output = output.squeeze(3)
    return output

关于 torch.nn.functional.grid_sample 的说明，可点击查看！

3.3 Loss Function

由于有整体预测及细分点预测两部分，所以Loss也由这两部分加和而成，代码如下：

class PointRendLoss(nn.CrossEntropyLoss):
    def __init__(self, aux=True, aux_weight=0.2, ignore_index=-1, **kwargs):
        super(PointRendLoss, self).__init__(ignore_index=ignore_index)
        self.aux = aux
        self.aux_weight = aux_weight
        self.ignore_index = ignore_index

    def forward(self, *inputs, **kwargs):
        result, gt = tuple(inputs)
        #result['res2']: [2, 256, 192, 192], 即xception的c1层提取到的特征
        #result['coarse']: [2, 19, 48, 48]
        #result['rend']: [2, 19, 48]
        #result['points']:[2, 48, 2]
        #gt:[2, 768, 768], 即图片对应的label
        
        #pred:[2, 19, 768, 768]，将粗糙预测的插值到label大小
        pred = F.interpolate(result["coarse"], gt.shape[-2:], mode="bilinear", align_corners=True)
       
       	#整体像素点的交叉熵loss
        seg_loss = F.cross_entropy(pred, gt, ignore_index=self.ignore_index)
		
		#根据不确定点坐标获得不确定点对应的gt
        gt_points = point_sample(
            gt.float().unsqueeze(1),
            result["points"],
            mode="nearest",
            align_corners=False
        ).squeeze_(1).long()
        
        #不确定点的交叉熵loss
        points_loss = F.cross_entropy(result["rend"], gt_points, ignore_index=self.ignore_index)
		
		#整体+不确定点
        loss = seg_loss + points_loss

        return dict(loss=loss)