文章目录
ROI pooling及ROI Align原理介绍
ROI Pooling
RoI Pooling用于将任意尺寸感兴趣区域的特征图,都转换为具有固定尺寸 H×W 的小特征图。其原理是将roi区域投影到特征图上得到hxw的特征图再将其划分成HxW网格,每个格子大小近似为h/H x w/W的子窗口,然后将每个子窗口中的值最大池化到相应的输出网格单元中。
步骤1,输入一个固定大小8x8的feature map。
步骤2,roi感兴趣区域在feature map上投影之后对应的位置坐标:(左上角,右下角坐标)依次为(0,3),(7,8)。
步骤3、将ROI区域划分为2x2的网格。
步骤4、对每个网格进行max pooling。
ROI Align
RoI Align用于将任意尺寸感兴趣区域的特征图,都转换为具有固定尺寸 H×W 的小特征图。其原理是将roi区域投影到特征图上得到hxw的特征图再将其划分成HxW网格,每个格子大小近似为h/W x w/W的子窗口,然后将每个子窗口中的值通过双线性插值得到当前区域的输出值。
步骤1,输入一个固定大小8x8的feature map。
步骤2、roi感兴趣区域在feature map上投影之后对应的位置坐标:(左上角,右下角坐标)依次为(0,3),(7,8)。
步骤3、在每个网格采样四个点,用红色x表示其中心位置。
步骤4、每个网格中的四个红色x的值,通过双线性插值计算。(具体的双线性插值计算过程可参考 详解 Mask-RCNN 中的 “RoIAlign” 作用 / 双线性插值的方法)
步骤5、对每个网格中四个值进行平均后,得到ROI Align的结果。
代码解析
代码主体使用的是B站up主霹雳吧啦Wz的Faster R-CNN代码。本文所使用的完整代码开源链接如下(Faster R-CNN)。本文新增的代码主要位于network_files/roi_function.py
中。
使用方式
在network_files/faster_rcnn_framework.py
中修改如下代码即可切换使用ROI Pooling和ROI Align。
class FasterRCNN(FasterRCNNBase):
....
# Multi-scale RoIAlign pooling
if box_roi_pool is None:
box_roi_pool = MultiScaleRoIAlign(
featmap_names=['0', '1', '2', '3'], # 在哪些特征层进行roi Align
output_size=[7, 7],
sampling_ratio=2)
# box_roi_pool = MultiScaleRoIPooling(
# featmap_names=['0', '1', '2', '3'], # 在哪些特征层进行roi pooling
# output_size=[7, 7]
# )
MultiScaleRoIOperation代码解析
在network_files/roi_function.py
有三个类:父类MultiScaleRoIOperation
,两个子类MultiScaleRoIPooling
,MultiScaleRoIAlign
。因为MultiScaleRoIPooling
和MultiScaleRoIAlign
除在计算roipool和roialign时有所不同,其余处理环节均一致。所以本文主要介绍MultiScaleRoIOperation
该类的实现。
初始化部分
class MultiScaleRoIOperation(nn.Module):
def __init__(self, featmap_names, output_size, sampling_ratio=None, method='align'):
super(MultiScaleRoIOperation, self).__init__()
assert all([type(featmap_name) == type('') for featmap_name in featmap_names]), \
'featmap_name must be a str type.'
self.method = method
if method == 'align':
assert type(sampling_ratio) in [int, float], 'sampling_ratio must be int or float when method is align'
self.sampling_ratio = sampling_ratio
self.featmap_names = featmap_names
self.output_size = output_size
各个输入参数为:
featmap_names=['0', '1', '2', '3'] # 在哪些特征层进行roi Align,
# ‘0’对应下图的P2, ‘1’对应下图的P3, ‘2'对应下图的P4, ‘3’对应下图的P5, 'pool'对应下图的P6
# 图片引用自B站up主霹雳吧啦Wz的FPN结构详解(https://www.bilibili.com/video/BV1dh411U7D9)
output_size=[7, 7] #输出的7x7网格大小
sampling_ratio=2 #对于ROI Align操作每个小网格采样的点数
前向传播部分
def forward(self, features, rois, image_shape):
"""
Arguments:
features (Dict[Tensor]): FPN feature maps
rois (List[Tensor[N, 4]]): proposal boxes
image_shape (Tuple[H, W]): image shape
Returns:
Tensor:
Pooled features
"""
filtered_features = self._filter_inputs(features, self.featmap_names)
rois = self._convert_to_roi_format(rois)
scales = self._setup_scales(filtered_features, image_shape)
# 计算features在FPN上所处的层级(2, 3, 4, 5)
lvl_min = -torch.log2(torch.tensor(scales[0], dtype=torch.float32)).item()
lvl_max = -torch.log2(torch.tensor(scales[-1], dtype=torch.float32)).item()
# 通过(https://arxiv.org/pdf/1612.03144v2)论文中给出的公式,计算每一个roi区域落入的FPN的层级
# k=\lfloor k_0+\log_2(\sqrt{wh}/224)\rfloor
k0 = 4
box_area = (rois[:, 4] - rois[:, 2]) * (rois[:, 3] - rois[:, 1])
k = torch.floor(k0 + torch.log2(torch.sqrt(box_area) / 224))
k = torch.clamp(k, lvl_min, lvl_max) - lvl_min
# 定义最终的输出格式shape:[1024, 256, 7, 7]
all_level_pooled_feature = torch.zeros([len(rois), filtered_features[0].shape[1], *self.output_size],
dtype=filtered_features[0].dtype, device=rois.device)
# 使用torchvision.ops.roi_align, roi_pool计算结果
for idx, (featmap_in_level, scale) in enumerate(zip(filtered_features, scales)):
mask_in_level = (k == idx)
rois_in_level = rois[mask_in_level]
if self.method == 'pooling':
pooled_feature = roi_pool(featmap_in_level, rois_in_level, output_size=self.output_size,
spatial_scale=scale)
elif self.method == 'align':
pooled_feature = roi_align(featmap_in_level, rois_in_level, output_size=self.output_size,
spatial_scale=scale, sampling_ratio=self.sampling_ratio)
all_level_pooled_feature[mask_in_level] = pooled_feature
return all_level_pooled_feature
典型输入参数为(以batch_size=2为例):
features={
'0': Tensor_shape:(2, 256, 200, 304)
'1': Tensor_shape:(2, 256, 100, 152)
'2': Tensor_shape:(2, 256, 50, 76)
'3': Tensor_shape:(2, 256, 25, 38)
'pool': Tensor_shape:(2, 256, 13, 19)
}
image_shapes=[(800, 1201), (800, 1066)]
rois=[Tensor_shape:(512, 4), Tensor_shape:(512, 4)]
辅助函数部分
@staticmethod
def _filter_inputs(x, featmap_names):
'''
通过featmap_names筛选输入的OrderedDict x, 输出为list[Tensor]。
'''
x_filtered = []
for k, v in x.items():
if k in featmap_names:
x_filtered.append(v)
return x_filtered
@staticmethod
def _convert_to_roi_format(rois):
'''
rois的形状从[Tensor_shape:(512, 4), Tensor_shape:(512, 4)]变为Tensor_shape[1024, 5]。
其中rois[:, 0]为[0, 0, 0, ...., 1, 1, 1]。
'''
rois = [
torch.cat([torch.full((roi.shape[0], 1), batch_idx, device=roi.device), roi], dim=1)
for batch_idx, roi in enumerate(rois)]
rois = torch.cat(rois, dim=0)
return rois
@staticmethod
def _get_scales(feature, image_size):
'''
计算feature的高宽相比于image_size的缩放倍数,例如 2^(-2)
'''
size = feature.shape[-2:]
scales = []
for s1, s2 in zip(size, image_size):
scale = float(s1) / float(s2)
scale = 2 ** (float(torch.tensor(scale).log2().round()))
scales.append(scale)
return scales[0]
def _setup_scales(self, features, image_shapes):
'''
计算得到每一feature的高宽相比于max(image_shapes)的缩放倍数,例如 2^(-2), 2^(-3), 2^(-4), 2^(-5)
'''
max_w = 0
max_h = 0
for shape in image_shapes:
max_h = max(shape[0], max_h)
max_w = max(shape[1], max_w)
input_size = (max_h, max_w)
scales = [self._get_scales(feat, input_size) for feat in features]
return scales
MultiScaleRoIPooling代码解析
将method=pooling
,调用父类MultiScaleRoIOperation
的forward方法。
class MultiScaleRoIPooling(MultiScaleRoIOperation):
def __init__(self, featmap_names, output_size):
super(MultiScaleRoIPooling, self).__init__(featmap_names, output_size, method='pooling')
def forward(self, features, rois, image_shape):
"""
Arguments:
features (Dict[Tensor]): FPN feature maps
rois (List[Tensor[N, 4]]): proposal boxes
image_shape (Tuple[H, W]): image shape
Returns:
Tensor:
Pooled features
"""
return super().forward(features, rois, image_shape)
MultiScaleRoIAlign代码解析
将method=align
,调用父类MultiScaleRoIOperation
的forward方法。
class MultiScaleRoIAlign(MultiScaleRoIOperation):
def __init__(self, featmap_names, output_size, sampling_ratio):
super(MultiScaleRoIAlign, self).__init__(featmap_names, output_size,
sampling_ratio=sampling_ratio, method='align')
def forward(self, features, rois, image_shape):
"""
Arguments:
features (Dict[Tensor]): FPN feature maps
rois (List[Tensor[N, 4]]): proposal boxes
image_shape (Tuple[H, W]): image shape
Returns:
Tensor:
aligned features
"""
return super().forward(features, rois, image_shape)
结果
最后在拥有四块NVIDIA GeForce RTX 3080
的机器上进行实验,导入fasterrcnn_resnet50_fpn_coco.pth
权重,在Pascal Voc2007+2012的train+val集上训练微调,Pascal Voc2007 test集上验证。微调至第10轮。训练参数如下:
Namespace(data_path='../../PASCAL_VOC_2012/data/', device='cuda', num_classes=20, batch_size=8, start_epoch=0, epochs=20, workers=8, lr=0.02, momentum=0.9, weight_decay=0.0001, lr_step_size=8, lr_steps=[7, 12], lr_gamma=0.1, print_freq=20, output_dir='./multi_train', resume='', aspect_ratio_group_factor=3, test_only=False, world_size=4, dist_url='env://', sync_bn=True, amp='True', rank=0, gpu=0, distributed=True, dist_backend='nccl')
验证集上的coco指标如下:
ROI Pooling
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.569
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.867
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.648
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.304
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.453
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.613
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.663
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.670
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.456
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.574
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.705
ROI Align
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.582
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.870
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.651
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.333
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.465
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.625
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.479
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.678
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.686
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.478
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.589
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.719
本文的完整代码链接如下(Faster R-CNN),如果觉得本文对你有用,记得点赞加收藏噢😊。
引用
https://blog.csdn.net/weixin_42782150/article/details/110946903
https://blog.csdn.net/Bit_Coders/article/details/121203584
https://blog.csdn.net/qq_42902997/article/details/105087407