pytorch中MultiScaleRoIAlign及MultiScaleRoIPooling实现

m0_57300975

已于 2024-07-21 21:27:34 修改

阅读量1.2k

点赞数 16

文章标签： pytorch 人工智能 python

于 2024-07-21 19:55:43 首次发布

本文链接：https://blog.csdn.net/m0_57300975/article/details/140574528

版权

文章目录

ROI pooling及ROI Align原理介绍
- ROI Pooling
- ROI Align
代码解析
结果
引用

ROI pooling及ROI Align原理介绍

ROI Pooling

RoI Pooling用于将任意尺寸感兴趣区域的特征图，都转换为具有固定尺寸 H×W 的小特征图。其原理是将roi区域投影到特征图上得到hxw的特征图再将其划分成HxW网格，每个格子大小近似为h/H x w/W的子窗口，然后将每个子窗口中的值最大池化到相应的输出网格单元中。
步骤1，输入一个固定大小8x8的feature map。
在这里插入图片描述
步骤2，roi感兴趣区域在feature map上投影之后对应的位置坐标：（左上角，右下角坐标）依次为（0，3），（7，8）。

步骤3、将ROI区域划分为2x2的网格。

步骤4、对每个网格进行max pooling。

在这里插入图片描述

ROI Align

RoI Align用于将任意尺寸感兴趣区域的特征图，都转换为具有固定尺寸 H×W 的小特征图。其原理是将roi区域投影到特征图上得到hxw的特征图再将其划分成HxW网格，每个格子大小近似为h/W x w/W的子窗口，然后将每个子窗口中的值通过双线性插值得到当前区域的输出值。

步骤1，输入一个固定大小8x8的feature map。
在这里插入图片描述
步骤2、roi感兴趣区域在feature map上投影之后对应的位置坐标：（左上角，右下角坐标）依次为（0，3），（7，8）。

步骤3、在每个网格采样四个点，用红色x表示其中心位置。

步骤4、每个网格中的四个红色x的值，通过双线性插值计算。(具体的双线性插值计算过程可参考详解 Mask-RCNN 中的 “RoIAlign” 作用 / 双线性插值的方法)
在这里插入图片描述
步骤5、对每个网格中四个值进行平均后，得到ROI Align的结果。

代码解析

代码主体使用的是B站up主霹雳吧啦Wz的Faster R-CNN代码。本文所使用的完整代码开源链接如下(Faster R-CNN)。本文新增的代码主要位于network_files/roi_function.py中。

使用方式

在network_files/faster_rcnn_framework.py中修改如下代码即可切换使用ROI Pooling和ROI Align。

class FasterRCNN(FasterRCNNBase):
	....
       #  Multi-scale RoIAlign pooling
       if box_roi_pool is None:
           box_roi_pool = MultiScaleRoIAlign(
               featmap_names=['0', '1', '2', '3'],  # 在哪些特征层进行roi Align
               output_size=[7, 7],
               sampling_ratio=2)
           # box_roi_pool = MultiScaleRoIPooling(
           #     featmap_names=['0', '1', '2', '3'],  # 在哪些特征层进行roi pooling
           #     output_size=[7, 7]
           # )

MultiScaleRoIOperation代码解析

在network_files/roi_function.py有三个类：父类MultiScaleRoIOperation，两个子类MultiScaleRoIPooling，MultiScaleRoIAlign。因为MultiScaleRoIPooling和MultiScaleRoIAlign除在计算roipool和roialign时有所不同，其余处理环节均一致。所以本文主要介绍MultiScaleRoIOperation该类的实现。

初始化部分

class MultiScaleRoIOperation(nn.Module):
    def __init__(self, featmap_names, output_size, sampling_ratio=None, method='align'):
        super(MultiScaleRoIOperation, self).__init__()
        assert all([type(featmap_name) == type('') for featmap_name in featmap_names]), \
            'featmap_name must be a str type.'
        self.method = method
        if method == 'align':
            assert type(sampling_ratio) in [int, float], 'sampling_ratio must be int or float when method is align'
            self.sampling_ratio = sampling_ratio
        self.featmap_names = featmap_names
        self.output_size = output_size

各个输入参数为：

featmap_names=['0', '1', '2', '3'] # 在哪些特征层进行roi Align, 
# ‘0’对应下图的P2, ‘1’对应下图的P3, ‘2'对应下图的P4, ‘3’对应下图的P5, 'pool'对应下图的P6
# 图片引用自B站up主霹雳吧啦Wz的FPN结构详解(https://www.bilibili.com/video/BV1dh411U7D9)
output_size=[7, 7] #输出的7x7网格大小
sampling_ratio=2 #对于ROI Align操作每个小网格采样的点数

前向传播部分

def forward(self, features, rois, image_shape):
        """
        Arguments:
            features (Dict[Tensor]): FPN feature maps
            rois (List[Tensor[N, 4]]): proposal boxes
            image_shape (Tuple[H, W]): image shape
        Returns:
            Tensor:
                Pooled features
        """
        filtered_features = self._filter_inputs(features, self.featmap_names)
        rois = self._convert_to_roi_format(rois)

        scales = self._setup_scales(filtered_features, image_shape)
		
		# 计算features在FPN上所处的层级(2, 3, 4, 5)
        lvl_min = -torch.log2(torch.tensor(scales[0], dtype=torch.float32)).item()
        lvl_max = -torch.log2(torch.tensor(scales[-1], dtype=torch.float32)).item()
		# 通过(https://arxiv.org/pdf/1612.03144v2)论文中给出的公式，计算每一个roi区域落入的FPN的层级
		# k=\lfloor k_0+\log_2(\sqrt{wh}/224)\rfloor
        k0 = 4
        box_area = (rois[:, 4] - rois[:, 2]) * (rois[:, 3] - rois[:, 1])
        k = torch.floor(k0 + torch.log2(torch.sqrt(box_area) / 224))
        k = torch.clamp(k, lvl_min, lvl_max) - lvl_min
		# 定义最终的输出格式shape:[1024, 256, 7, 7]
        all_level_pooled_feature = torch.zeros([len(rois), filtered_features[0].shape[1], *self.output_size],
                                               dtype=filtered_features[0].dtype, device=rois.device)
        # 使用torchvision.ops.roi_align, roi_pool计算结果
        for idx, (featmap_in_level, scale) in enumerate(zip(filtered_features, scales)):
            mask_in_level = (k == idx)
            rois_in_level = rois[mask_in_level]
            if self.method == 'pooling':
                pooled_feature = roi_pool(featmap_in_level, rois_in_level, output_size=self.output_size,
                                          spatial_scale=scale)
            elif self.method == 'align':
                pooled_feature = roi_align(featmap_in_level, rois_in_level, output_size=self.output_size,
                                           spatial_scale=scale, sampling_ratio=self.sampling_ratio)
            all_level_pooled_feature[mask_in_level] = pooled_feature

        return all_level_pooled_feature

典型输入参数为(以batch_size=2为例)：

features={
	'0': Tensor_shape:(2, 256, 200, 304)
	'1': Tensor_shape:(2, 256, 100, 152)
	'2': Tensor_shape:(2, 256, 50, 76)
	'3': Tensor_shape:(2, 256, 25, 38)
	'pool': Tensor_shape:(2, 256, 13, 19)
}
image_shapes=[(800, 1201), (800, 1066)]
rois=[Tensor_shape:(512, 4), Tensor_shape:(512, 4)]

辅助函数部分

@staticmethod
def _filter_inputs(x, featmap_names):
	'''
		通过featmap_names筛选输入的OrderedDict x, 输出为list[Tensor]。
	'''
    x_filtered = []
    for k, v in x.items():
        if k in featmap_names:
            x_filtered.append(v)
    return x_filtered
    
@staticmethod
def _convert_to_roi_format(rois):
	'''
		rois的形状从[Tensor_shape:(512, 4), Tensor_shape:(512, 4)]变为Tensor_shape[1024, 5]。
		其中rois[:, 0]为[0, 0, 0, ...., 1, 1, 1]。
	'''
    rois = [
        torch.cat([torch.full((roi.shape[0], 1), batch_idx, device=roi.device), roi], dim=1)
        for batch_idx, roi in enumerate(rois)]
    rois = torch.cat(rois, dim=0)
    return rois
    
@staticmethod
def _get_scales(feature, image_size):
	'''
		计算feature的高宽相比于image_size的缩放倍数，例如 2^(-2)
	'''
    size = feature.shape[-2:]
    scales = []
    for s1, s2 in zip(size, image_size):
        scale = float(s1) / float(s2)
        scale = 2 ** (float(torch.tensor(scale).log2().round()))
        scales.append(scale)
    return scales[0]
    
def _setup_scales(self, features, image_shapes):
	'''
		计算得到每一feature的高宽相比于max(image_shapes)的缩放倍数，例如 2^(-2), 2^(-3), 2^(-4), 2^(-5)
	'''
    max_w = 0
    max_h = 0
    for shape in image_shapes:
        max_h = max(shape[0], max_h)
        max_w = max(shape[1], max_w)
    input_size = (max_h, max_w)

    scales = [self._get_scales(feat, input_size) for feat in features]
    return scales

MultiScaleRoIPooling代码解析

将method=pooling，调用父类MultiScaleRoIOperation的forward方法。

class MultiScaleRoIPooling(MultiScaleRoIOperation):
    def __init__(self, featmap_names, output_size):
        super(MultiScaleRoIPooling, self).__init__(featmap_names, output_size, method='pooling')

    def forward(self, features, rois, image_shape):
        """
        Arguments:
            features (Dict[Tensor]): FPN feature maps
            rois (List[Tensor[N, 4]]): proposal boxes
            image_shape (Tuple[H, W]): image shape
        Returns:
            Tensor:
                Pooled features
        """
        return super().forward(features, rois, image_shape)

MultiScaleRoIAlign代码解析

将method=align，调用父类MultiScaleRoIOperation的forward方法。

class MultiScaleRoIAlign(MultiScaleRoIOperation):
    def __init__(self, featmap_names, output_size, sampling_ratio):
        super(MultiScaleRoIAlign, self).__init__(featmap_names, output_size,
                                                 sampling_ratio=sampling_ratio, method='align')

    def forward(self, features, rois, image_shape):
        """
        Arguments:
            features (Dict[Tensor]): FPN feature maps
            rois (List[Tensor[N, 4]]): proposal boxes
            image_shape (Tuple[H, W]): image shape
        Returns:
            Tensor:
                aligned features
        """
        return super().forward(features, rois, image_shape)

结果

最后在拥有四块NVIDIA GeForce RTX 3080的机器上进行实验，导入fasterrcnn_resnet50_fpn_coco.pth权重，在Pascal Voc2007+2012的train+val集上训练微调，Pascal Voc2007 test集上验证。微调至第10轮。训练参数如下：

Namespace(data_path='../../PASCAL_VOC_2012/data/', device='cuda', num_classes=20, batch_size=8, start_epoch=0, epochs=20, workers=8, lr=0.02, momentum=0.9, weight_decay=0.0001, lr_step_size=8, lr_steps=[7, 12], lr_gamma=0.1, print_freq=20, output_dir='./multi_train', resume='', aspect_ratio_group_factor=3, test_only=False, world_size=4, dist_url='env://', sync_bn=True, amp='True', rank=0, gpu=0, distributed=True, dist_backend='nccl')

验证集上的coco指标如下：
ROI Pooling

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.569
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.867
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.648
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.304
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.453
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.613
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.663
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.670
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.456
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.574
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.705

ROI Align

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.582
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.870
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.651
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.333
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.465
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.625
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.479
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.678
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.686
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.478
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.589
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.719