torchvision Faster-RCNN ResNet-50 FPN代码解析（图片转换和坐标）

最新推荐文章于 2024-04-01 14:48:38 发布

王飞95

最新推荐文章于 2024-04-01 14:48:38 发布

阅读量3k

点赞数 5

分类专栏： torchvision 笔记 torch 文章标签：计算机视觉图像识别

本文链接：https://blog.csdn.net/defi_wang/article/details/108920448

版权

笔记同时被 3 个专栏收录

20 篇文章 4 订阅

订阅专栏

torchvision

5 篇文章 3 订阅

订阅专栏

torch

5 篇文章 0 订阅

订阅专栏

图像转换

在torchvision\models\detection\faster_rcnn.py构造函数中，指定了image mean/std，这些前面笔记中都介绍了原因，这里不多说了，这里还指明了最大和最小的图像长宽，这里是800和1333，表明转换出来的图像不能超出800x1333或者1333x800这个尺寸。

class FasterRCNN(GeneralizedRCNN):
    def __init__(self, backbone, num_classes=None,
                 # transform parameters
                 min_size=800, max_size=1333,
				 ......
		......
        if image_mean is None:
            image_mean = [0.485, 0.456, 0.406]
        if image_std is None:
            image_std = [0.229, 0.224, 0.225]
        transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)
      	......

GeneralizedRCNNTransform

在喂数据给网络之前，通过这个转换模块执行输入和目标转换。

class GeneralizedRCNNTransform(nn.Module):
    def forward(self,
                images,       # type: List[Tensor]
                targets=None  # type: Optional[List[Dict[str, Tensor]]]
                ):
        ......
        for i in range(len(images)):
        	......
            image = self.normalize(image)
            image, target_index = self.resize(image, target_index)
            ......
        image_sizes = [img.shape[-2:] for img in images]
        images = self.batch_images(images)
        ......

在这里插入图片描述

Normalize

这里输入的一个图像列表，输出的是转换之后的图片张量，首先对图像进行normalize的处理

    def normalize(self, image):
        dtype, device = image.dtype, image.device
        mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)
        std = torch.as_tensor(self.image_std, dtype=dtype, device=device)
        return (image - mean[:, None, None]) / std[:, None, None]

产生的tensor每个像素值现在在0附近分布，更易于神经网络处理。

Resize

在此步骤中首先算出图像是基于长还是宽进行缩放

def _resize_image_and_masks(image, self_min_size, self_max_size, target):
    # type: (Tensor, float, float, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
    im_shape = torch.tensor(image.shape[-2:])
    min_size = float(torch.min(im_shape))
    max_size = float(torch.max(im_shape))
    scale_factor = self_min_size / min_size
    if max_size * scale_factor > self_max_size:
        scale_factor = self_max_size / max_size
    image = torch.nn.functional.interpolate(
        image[None], scale_factor=scale_factor, mode='bilinear', recompute_scale_factor=True,
        align_corners=False)[0]

    if target is None:
        return image, target
	......
    return image, target

class GeneralizedRCNNTransform(nn.Module):
	......
    def resize(self, image, target):
        # type: (Tensor, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
        h, w = image.shape[-2:]
        if self.training:
            size = float(self.torch_choice(self.min_size))
        else:
            # FIXME assume for now that testing uses the largest scale
            size = float(self.min_size[-1])
        if torchvision._is_tracing():
            image, target = _resize_image_and_masks_onnx(image, size, float(self.max_size), target)
        else:
            image, target = _resize_image_and_masks(image, size, float(self.max_size), target)

首先得到图像宽高（599, 900）：

    im_shape = torch.tensor(image.shape[-2:])

然后得到图像宽高的最小值和最大值(599, 900)：

    min_size = float(torch.min(im_shape))
    max_size = float(torch.max(im_shape))

根据最小图像尺寸的要求，得到对应的缩放比(800/599 = 1.335559265442404006677796327212)

    scale_factor = self_min_size / min_size

看看这个缩放比，算出来的最大边（900*800/599 = 1202.00333）是否超过最大设置值（1333），如果超过了，用最长边来计算缩放比，这里没超过，不需要做这一步：

    if max_size * scale_factor > self_max_size:
        scale_factor = self_max_size / max_size

用pytorch中的interpolate做插值缩放，得到一个(3, 800, 1202)的图像张量：

    image = torch.nn.functional.interpolate(
        image[None], scale_factor=scale_factor, mode='bilinear', recompute_scale_factor=True,
        align_corners=False)[0]

batch_images

得到各个维度最大的数字（比如，3, 800, 1202）, 以及设定的stride（比如：32）：

    def batch_images(self, images, size_divisible=32):
        # type: (List[Tensor], int) -> Tensor
        if torchvision._is_tracing():
            # batch_images() does not export well to ONNX
            # call _onnx_batch_images() instead
            return self._onnx_batch_images(images, size_divisible)

        max_size = self.max_by_axis([list(img.shape) for img in images])
        stride = float(size_divisible)

让长、宽和stride对齐，800能被32整除，不调整，1202需要调整为1216：

        max_size[1] = int(math.ceil(float(max_size[1]) / stride) * stride)
        max_size[2] = int(math.ceil(float(max_size[2]) / stride) * stride

重新resize图像，并且在右边，或者底部添加padding（黑色），这里得到的是(3, 800, 1216)的图片张量。

        batch_shape = [len(images)] + max_size
        batched_imgs = images[0].new_full(batch_shape, 0)
        for img, pad_img in zip(images, batched_imgs):
            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)

坐标

这就是为什么在代码中涉及到三个image sizes:
original_image_sizes：（599， 900），最后bbox会转换到这个坐标上
image_sizes：（800，1202），最后生成的bbox等都是基于这个坐标
feature_maps：基于（800，1216），然后所有的level的feature map size就是：

name	size
input image	800x1216
conv1/maxpool	200x304
conv2_x	100x152
conv3_x	50x76
conv4_x	25x38
conv5_x	13x19

在ROI Pool/Head中最后会把Proposal裁剪到image_size对应的区域内。

def clip_boxes_to_image(boxes, size):
    dim = boxes.dim()
    boxes_x = boxes[..., 0::2]
    boxes_y = boxes[..., 1::2]
    height, width = size

    if torchvision._is_tracing():
        boxes_x = torch.max(boxes_x, torch.tensor(0, dtype=boxes.dtype, device=boxes.device))
        boxes_x = torch.min(boxes_x, torch.tensor(width, dtype=boxes.dtype, device=boxes.device))
        boxes_y = torch.max(boxes_y, torch.tensor(0, dtype=boxes.dtype, device=boxes.device))
        boxes_y = torch.min(boxes_y, torch.tensor(height, dtype=boxes.dtype, device=boxes.device))
    else:
        boxes_x = boxes_x.clamp(min=0, max=width)
        boxes_y = boxes_y.clamp(min=0, max=height)

    clipped_boxes = torch.stack((boxes_x, boxes_y), dim=dim)
    return clipped_boxes.reshape(boxes.shape)

class RoIHeads(torch.nn.Module):
    def postprocess_detections(self,
                               class_logits,    # type: Tensor
                               box_regression,  # type: Tensor
                               proposals,       # type: List[Tensor]
                               image_shapes     # type: List[Tuple[int, int]]
                               ):
        ......
        all_boxes = []
        all_scores = []
        all_labels = []
        for boxes, scores, image_shape in zip(pred_boxes_list, pred_scores_list, image_shapes):
            boxes = box_ops.clip_boxes_to_image(boxes, image_shape) 
        ......

最后又会把image_sizes坐标转换成original_image_sizes，具体就在transform.postprocess这个地方，把image_sizes(800, 1202)转换成original_image_sizes(500, 900):

class GeneralizedRCNN(nn.Module):
    def forward(self, images, targets=None):
    ......
        features = self.backbone(images.tensors)
        if isinstance(features, torch.Tensor):
            features = OrderedDict([('0', features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

王飞95

关注

5
点赞
踩
15

收藏

觉得还不错? 一键收藏
1
评论
torchvision Faster-RCNN ResNet-50 FPN代码解析（图片转换和坐标）

图像转换在torchvision\models\detection\faster_rcnn.py构造函数中，指定了image mean/std，这些前面笔记中都介绍了原因，这里不多说了，这里还指明了最大和最小的图像长宽，这里是800和1333，表明转换出来的图像不能超出800x1333或者1333x800这个尺寸。class FasterRCNN(GeneralizedRCNN): def __init__(self, backbone, num_classes=None,
复制链接

扫一扫