torchvision源码解读——Faster-RCNN

最新推荐文章于 2024-08-05 08:25:12 发布

张弘之

最新推荐文章于 2024-08-05 08:25:12 发布

阅读量145

点赞数

文章标签： python

本文链接：https://blog.csdn.net/graviton/article/details/107111398

版权

概述

torchvision是用pytorch写成的著名的计算机视觉库，这篇博客介绍经典的目标检测算法Faster-RCNN，而torchvision中的Faster-RCNN选取的特征提取网络是ResNet-50,并且还加上了后来的研究成果FPN(特征金子塔)。我们大致以fasterrcnn_resnet50_fpn函数中的内容作为我们阅读源码的步骤。

特征提取网络

在这里插入图片描述
我们所使用的特征提取网络如下，所谓的Bottleneck如上表所示，以Bottleneck1为例，由64个 $1\times1$ ，64个 $3\times3$ ，256个 $1\times1$ 卷积核经3次堆叠形成的。关于ResNet50的源码在此链接
这里的self.inplane=64，意味着第一个卷积层是有64个 $7\times7$ 卷积，依次类推。在源码中，Bottleneck1~Bottleneck4分别被命名为layer1～layer4

        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

其中最重要的函数块是_make_layer()，其实现的主要功能就是实现第二幅图中的不同的Bottleneck,这些模块是由几个固定的卷积层循环堆积而成的源码如下：

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

最后请注意self.inplanes=planes*block.expansion,即 $512 * 4 = 2048$
接下来特征金字塔主要由BackboneWithFPN和FeaturePyramidNetwork来实现，其主要源码为:

class BackboneWithFPN(nn.Module):
    """
    Adds a FPN on top of a model.
    Internally, it uses torchvision.models._utils.IntermediateLayerGetter to
    extract a submodel that returns the feature maps specified in return_layers.
    The same limitations of IntermediatLayerGetter apply here.
    Arguments:
        backbone (nn.Module)
        return_layers (Dict[name, new_name]): a dict containing the names
            of the modules for which the activations will be returned as
            the key of the dict, and the value of the dict is the name
            of the returned activation (which the user can specify).
        in_channels_list (List[int]): number of channels for each feature map
            that is returned, in the order they are present in the OrderedDict
        out_channels (int): number of channels in the FPN.
    Attributes:
        out_channels (int): the number of channels in the FPN
    """
    def __init__(self, backbone, return_layers, in_channels_list, out_channels):
        super(BackboneWithFPN, self).__init__()
        # IntermediateLayerGetter的作用是依据return_layers提取对应层
        # 如return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}，backbone是resnet50
        # 则提取resnet50中名为layer0，layer1， layer2， layer3，layer4的层，并分别命名为0，1，2，3
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
        self.fpn = FeaturePyramidNetwork(
            in_channels_list=in_channels_list,
            out_channels=out_channels,
            extra_blocks=LastLevelMaxPool(),
        )
        self.out_channels = out_channels

    def forward(self, x):
        x = self.body(x)
        x = self.fpn(x)
        return x


def resnet_fpn_backbone(backbone_name, pretrained, norm_layer=misc_nn_ops.FrozenBatchNorm2d, trainable_layers=3):
    backbone = resnet.__dict__[backbone_name](
        pretrained=pretrained,
        norm_layer=norm_layer)
    """
    Constructs a specified ResNet backbone with FPN on top. Freezes the specified number of layers in the backbone.

    Examples::

        >>> from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
        >>> backbone = resnet_fpn_backbone('resnet50', pretrained=True, trainable_layers=3)
        >>> # get some dummy image
        >>> x = torch.rand(1,3,64,64)
        >>> # compute the output
        >>> output = backbone(x)
        >>> print([(k, v.shape) for k, v in output.items()])
        >>> # returns
        >>>   [('0', torch.Size([1, 256, 16, 16])),
        >>>    ('1', torch.Size([1, 256, 8, 8])),
        >>>    ('2', torch.Size([1, 256, 4, 4])),
        >>>    ('3', torch.Size([1, 256, 2, 2])),
        >>>    ('pool', torch.Size([1, 256, 1, 1]))]

    Arguments:
        backbone_name (string): resnet architecture. Possible values are 'ResNet', 'resnet18', 'resnet34', 'resnet50',
             'resnet101', 'resnet152', 'resnext50_32x4d', 'resnext101_32x8d', 'wide_resnet50_2', 'wide_resnet101_2'
        norm_layer (torchvision.ops): it is recommended to use the default value. For details visit:
            (https://github.com/facebookresearch/maskrcnn-benchmark/issues/267)
        pretrained (bool): If True, returns a model with backbone pre-trained on Imagenet
        trainable_layers (int): number of trainable (not frozen) resnet layers starting from final block.
            Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
    """
    # select layers that wont be frozen
    assert trainable_layers <= 5 and trainable_layers >= 0
    layers_to_train = ['layer4', 'layer3', 'layer2', 'layer1', 'conv1'][:trainable_layers]
    # freeze layers only if pretrained backbone is used
    for name, parameter in backbone.named_parameters():
        if all([not name.startswith(layer) for layer in layers_to_train]):
            parameter.requires_grad_(False)

    return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}

    in_channels_stage2 = backbone.inplanes // 8
    in_channels_list = [
        in_channels_stage2,
        in_channels_stage2 * 2,
        in_channels_stage2 * 4,
        in_channels_stage2 * 8,
    ]
    out_channels = 256
    return BackboneWithFPN(backbone, return_layers, in_channels_list, out_channels)

BackboneWithFPN主要使用了IntermediateLayerGetter和IntermediateLayerGetter，前者主要是提取出需要输出到特征金字塔中的各个层，代码如下：这个模块继承了一个重要的内置类nn.ModuleDict,
nn.ModuleDict将返回一个由(string:module)组成的有序字典named_children返回包含子模块的迭代器，同时产生模块的名称以及模块本身。其用法见此
return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}里面存储着模型旧有的名字和需要相应替换的名字。最后将返回一个有关模型的有序字典。

class IntermediateLayerGetter(nn.ModuleDict):
    """
    Module wrapper that returns intermediate layers from a model

    It has a strong assumption that the modules have been registered
    into the model in the same order as they are used.
    This means that one should **not** reuse the same nn.Module
    twice in the forward if you want this to work.

    Additionally, it is only able to query submodules that are directly
    assigned to the model. So if `model` is passed, `model.feature1` can
    be returned, but not `model.feature1.layer2`.

    Arguments:
        model (nn.Module): model on which we will extract the features
        return_layers (Dict[name, new_name]): a dict containing the names
            of the modules for which the activations will be returned as
            the key of the dict, and the value of the dict is the name
            of the returned activation (which the user can specify).

    Examples::

        >>> m = torchvision.models.resnet18(pretrained=True)
        >>> # extract layer1 and layer3, giving as names `feat1` and feat2`
        >>> new_m = torchvision.models._utils.IntermediateLayerGetter(m,
        >>>     {'layer1': 'feat1', 'layer3': 'feat2'})
        >>> out = new_m(torch.rand(1, 3, 224, 224))
        >>> print([(k, v.shape) for k, v in out.items()])
        >>>     [('feat1', torch.Size([1, 64, 56, 56])),
        >>>      ('feat2', torch.Size([1, 256, 14, 14]))]
    """
    _version = 2
    __annotations__ = {
        "return_layers": Dict[str, str],
    }

    def __init__(self, model, return_layers):
        if not set(return_layers).issubset([name for name, _ in model.named_children()]):
            raise ValueError("return_layers are not present in model")
        orig_return_layers = return_layers
        return_layers = {str(k): str(v) for k, v in return_layers.items()}
        layers = OrderedDict()
        for name, module in model.named_children():
            layers[name] = module
            if name in return_layers:
                del return_layers[name]
            if not return_layers:
                break

        super(IntermediateLayerGetter, self).__init__(layers)
        self.return_layers = orig_return_layers

    def forward(self, x):
        out = OrderedDict()
        for name, module in self.items():
            x = module(x)
            if name in self.return_layers:
                out_name = self.return_layers[name]
                out[out_name] = x
        return out

后者FeaturePyramidNetwork用于完成特征金字塔的代码。
resnet_fpn_backbone将完善使用BackboneWithFPN的一些输入参数，并最后返回该类的一个实例，完成特征金字塔的构建。
在这里插入图片描述
在最后一次使用make_layer()后，self.inplanes=2048,即上面的resnet_fpn_backbone中的backbone.inplanes=2048。如上图所示，in_channels_list就是各个特征层的通道数。self.children在初始化时被使用，这是一个常见的初始化方式，注意其在初始化时与self.modules的区别

class FeaturePyramidNetwork(nn.Module):
    """
    Module that adds a FPN from on top of a set of feature maps. This is based on
    `"Feature Pyramid Network for Object Detection" <https://arxiv.org/abs/1612.03144>`_.

    The feature maps are currently supposed to be in increasing depth
    order.

    The input to the model is expected to be an OrderedDict[Tensor], containing
    the feature maps on top of which the FPN will be added.

    Arguments:
        in_channels_list (list[int]): number of channels for each feature map that
            is passed to the module
        out_channels (int): number of channels of the FPN representation
        extra_blocks (ExtraFPNBlock or None): if provided, extra operations will
            be performed. It is expected to take the fpn features, the original
            features and the names of the original features as input, and returns
            a new list of feature maps and their corresponding names

    Examples::

        >>> m = torchvision.ops.FeaturePyramidNetwork([10, 20, 30], 5)
        >>> # get some dummy data
        >>> x = OrderedDict()
        >>> x['feat0'] = torch.rand(1, 10, 64, 64)
        >>> x['feat2'] = torch.rand(1, 20, 16, 16)
        >>> x['feat3'] = torch.rand(1, 30, 8, 8)
        >>> # compute the FPN on top of x
        >>> output = m(x)
        >>> print([(k, v.shape) for k, v in output.items()])
        >>> # returns
        >>>   [('feat0', torch.Size([1, 5, 64, 64])),
        >>>    ('feat2', torch.Size([1, 5, 16, 16])),
        >>>    ('feat3', torch.Size([1, 5, 8, 8]))]

    """

    def __init__(self, in_channels_list, out_channels, extra_blocks=None):
        super(FeaturePyramidNetwork, self).__init__()
        self.inner_blocks = nn.ModuleList()
        self.layer_blocks = nn.ModuleList()
        for in_channels in in_channels_list:
            if in_channels == 0:
                continue
            # 在FPN中1x1 Conv 用于将输出通道数变为256，3x3 Conv 用与改善上卷积中产生的混叠现象
            # 在经过着两个卷积的特征将相加求和，并作为特征金字塔中的诸层
            inner_block_module = nn.Conv2d(in_channels, out_channels, 1) # 1x1 conv
            layer_block_module = nn.Conv2d(out_channels, out_channels, 3, padding=1) # 3x3 conv
            self.inner_blocks.append(inner_block_module)
            self.layer_blocks.append(layer_block_module)

        # initialize parameters now to avoid modifying the initialization of top_blocks
        for m in self.children():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_uniform_(m.weight, a=1)
                nn.init.constant_(m.bias, 0)

        if extra_blocks is not None:
            assert isinstance(extra_blocks, ExtraFPNBlock)
        self.extra_blocks = extra_blocks

    def forward(self, x):
        """
        Computes the FPN for a set of feature maps.

        Arguments:
            x (OrderedDict[Tensor]): feature maps for each feature level.

        Returns:
            results (OrderedDict[Tensor]): feature maps after FPN layers.
                They are ordered from highest resolution first.
        """
        # unpack OrderedDict into two lists for easier handling
        names = list(x.keys())
        x = list(x.values())

        last_inner = self.inner_blocks[-1](x[-1])
        results = []
        results.append(self.layer_blocks[-1](last_inner))
        for feature, inner_block, layer_block in zip(
            x[:-1][::-1], self.inner_blocks[:-1][::-1], self.layer_blocks[:-1][::-1]
        ):
            if not inner_block:
                continue
            inner_lateral = inner_block(feature)
            feat_shape = inner_lateral.shape[-2:]
            inner_top_down = F.interpolate(last_inner, size=feat_shape, mode="nearest")
            last_inner = inner_lateral + inner_top_down
            results.insert(0, layer_block(last_inner))

        if self.extra_blocks is not None:
            results, names = self.extra_blocks(results, x, names)

        # make it back an OrderedDict
        out = OrderedDict([(k, v) for k, v in zip(names, results)])

        return out

FeaturePyramidNetwork主要完成如下的操作： $1\times1$ conv，2倍上采样， $3\times3$ conv。
在这里插入图片描述
在forward(self,x)里，。x是前面得到的有关子模型的有序字典。self.inner_blocks相应存储的是 $1\times1$ conv，self.layer_blocks里面存储的是 $3\times3$ conv，F.interpolate(last_inner, size=feat_shape,mode="nearest")。另外，由于最后一个(处于网络最深处的)的特征层不参与 $3\times3$ conv和相应的求和，所以被单独的提取出来，下来的遍历循环则处理其他层，这里需要注意的是，这里使用了一个比较特殊的索引操作，这里使用一个例子来说明其意思。

In [1]: x = [1,2,3]
In [2]: x[:-1][::-1]
Out[2]: [2, 1]

forward()的返回值就是FeaturePyramidNetwork层最后的输出，最后将会被送入FasterRCNN类中。

锚框的生成

在torchvision里面锚框的生成主要依靠AnchorGenerator类来实现，其源码如下

class AnchorGenerator(nn.Module):
    __annotations__ = {
        "cell_anchors": Optional[List[torch.Tensor]],
        "_cache": Dict[str, List[torch.Tensor]]
    }

    """
    Module that generates anchors for a set of feature maps and
    image sizes.

    The module support computing anchors at multiple sizes and aspect ratios
    per feature map. This module assumes aspect ratio = height / width for
    each anchor.

    sizes and aspect_ratios should have the same number of elements, and it should
    correspond to the number of feature maps.

    sizes[i] and aspect_ratios[i] can have an arbitrary number of elements,
    and AnchorGenerator will output a set of sizes[i] * aspect_ratios[i] anchors
    per spatial location for feature map i.

    Arguments:
        sizes (Tuple[Tuple[int]]):
        aspect_ratios (Tuple[Tuple[float]]):
    """

    def __init__(
        self,
        sizes=(128, 256, 512),
        aspect_ratios=(0.5, 1.0, 2.0),
    ):
        super(AnchorGenerator, self).__init__()

        if not isinstance(sizes[0], (list, tuple)):
            # TODO change this
            sizes = tuple((s,) for s in sizes)
        if not isinstance(aspect_ratios[0], (list, tuple)):
            aspect_ratios = (aspect_ratios,) * len(sizes)

        assert len(sizes) == len(aspect_ratios)

        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.cell_anchors = None
        self._cache = {}

    # TODO: https://github.com/pytorch/pytorch/issues/26792
    # For every (aspect_ratios, scales) combination, output a zero-centered anchor with those values.
    # (scales, aspect_ratios) are usually an element of zip(self.scales, self.aspect_ratios)
    # This method assumes aspect ratio = height / width for an anchor.
    def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device="cpu"):
        # type: (List[int], List[float], int, Device) -> Tensor  # noqa: F821
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1 / h_ratios
        # tensor索引里面使用写None相当于在对应位置扩充维度，ws的形状为(n,1)
        # tensor.view(-1)将tensor变成一维，即（1，n）
        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)
        # 形状为(1,n)的行向量与形状为(n,1)的列向量点乘时都被广播为(n,n)
        #base_anchors的形状为(n*n, 4)
        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        return base_anchors.round()

    def set_cell_anchors(self, dtype, device):
        # type: (int, Device) -> None  # noqa: F821
        if self.cell_anchors is not None:
            cell_anchors = self.cell_anchors
            assert cell_anchors is not None
            # suppose that all anchors have the same device
            # which is a valid assumption in the current state of the codebase
            if cell_anchors[0].device == device:
                return
        # [base_anchors,sizes,ratios]
        # example:base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        # sizes = (32,)或(64,)等
        # ratios = (0.5,1.0,2.0)
        cell_anchors = [
            self.generate_anchors(
                sizes,
                aspect_ratios,
                dtype,
                device
            )
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
        ]
        self.cell_anchors = cell_anchors

    def num_anchors_per_location(self):
        return [len(s) * len(a) for s, a in zip(self.sizes, self.aspect_ratios)]

    # For every combination of (a, (g, s), i) in (self.cell_anchors, zip(grid_sizes, strides), 0:2),
    # output g[i] anchors that are s[i] distance apart in direction i, with the same dimensions as a.
    def grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        anchors = []
        cell_anchors = self.cell_anchors
        assert cell_anchors is not None
        # 这里我们以一张图像来解释下面代码的作用，一张图像，5个grid_sizes,5个步长strdes，
        # 15(5x3,5代表5个锚框预设大小，3代表3个不同的宽高比)个cell_anchors,但是被分成来5组，注意在set_cell_anchors()
        # 里面self.cell_anchors的生成代码，每种预设大小与三种宽高比来生成self.cell_anchors列表的一个元素，共有五种
        # 预设大小，所以生成来五组元素，用zip函数实现了5个不同的特征层，对应着不同的锚框面积。

        for size, stride, base_anchors in zip(
            grid_sizes, strides, cell_anchors
        ):
            grid_height, grid_width = size
            stride_height, stride_width = stride
            device = base_anchors.device

            # For output anchor, compute [x_center, y_center, x_center, y_center]
            shifts_x = torch.arange(
                0, grid_width, dtype=torch.float32, device=device
            ) * stride_width
            shifts_y = torch.arange(
                0, grid_height, dtype=torch.float32, device=device
            ) * stride_height
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            # For every (base anchor, output anchor) pair,
            # offset each zero-centered base anchor by the center of the output anchor.
            anchors.append(
                (shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4)
            )

        return anchors

    def cached_grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        key = str(grid_sizes) + str(strides)
        if key in self._cache:
            return self._cache[key]
        anchors = self.grid_anchors(grid_sizes, strides)
        self._cache[key] = anchors
        return anchors

    def forward(self, image_list, feature_maps):
        # type: (ImageList, List[Tensor]) -> List[Tensor]
        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])
        image_size = image_list.tensors.shape[-2:]
        dtype, device = feature_maps[0].dtype, feature_maps[0].device
        strides = [[torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
                    torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device)] for g in grid_sizes]
        self.set_cell_anchors(dtype, device)
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
        anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                anchors_in_image.append(anchors_per_feature_map)
            anchors.append(anchors_in_image)
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        # Clear the cache in case that memory leaks.
        self._cache.clear()
        return anchors

在FasterRCNN类中先验框的生成主要与以下几行代码相关

        if rpn_anchor_generator is None:
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
            # aspect_ratios = ((0.5, 1.0, 2.0),(0.5, 1.0, 2.0),.............(0.5, 1.0, 2.0))
            # AnchorGenerator()代码中有assert len(sizes) == len(aspect_ratios)
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
            rpn_anchor_generator = AnchorGenerator(
                anchor_sizes, aspect_ratios
            )

从AnchorGenerator类中的forward()方法开始读起，站在forward(self, image_list, feature_maps)的角度该方式有两个在前向传播中要使用的输入，一个是image_list,另一个是feature_maps，前者是整个模型的输入图像，使用用一种名为ImageList的自编类来存储图像数组的值和其形状，后者就是共享特征层，只不过使用来特征金字塔的网络，其存在着多个共享特征层，这些共享特征层以orderdict的形式被存储着。接下来是计算strides，计算公式 $strides=[\frac{images\;sizes}{feature\;map\;sizes}]$ ，这是我们在原始图像上设置锚点的步长，
然后调用self.set_cell_anchors()方法，self.set_cell_anchors()方法实则调用了self.generate_anchors()，self.generate_anchors()方法依据设定的大小scales和宽高比aspect_ratios来生成不同的锚框大小,self.generate_anchors()最后返回base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2，其形状为(n*n, 4)，而self.set_cell_anchors()方法主要是计算cell_anchors = [self.generate_anchors(sizes,aspect_ratios,dtype,device) for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)], 以一个循环送入相应的参数到self.generate_anchors相当于在不同的形状和宽高比情况下生成不同锚框的[-ws/2, -hs/2, ws/2, hs/2]。
接下来是self.cached_grid_anchors()，该方法实则调用self.grid_anchors()方法，该方法先是生成在输入图像上锚点[x_center, y_center, x_center, y_center]，然后将锚点与cell_anchors相加，即 $[X_c-\frac{w_s}{2},Y_c-\frac{w_s}{2},X_c+\frac{w_s}{2},Y_c+\frac{w_s}{2}]$ ，前两个是图像坐标系的左上角坐标，后两个是图像坐标系右下角坐标，这样我们就获得所有锚框的位置信息,这些信息以列表的形式最后被返回。在FPN中不同的特征层对应着不同的锚框像素面积，如处在网络最低层的特征层，锚框面积为 $512^2$ ，在self.grid_anchors()方法中主要通过一个for循环来生成。除此之外，self.cached_grid_anchors()还得到了self._cache,self._cache是一个字典，其键key=str(grid_sizes) + str(strides)，所对应的值就是self.grid_anchors()所返回的anchors。
最后对锚框进行整理，对于每张图像给予所有的锚框。
num_anchors_per_location()返回每个锚点处的锚框数，在带有FPN的Faster-RCNN中这个值为3

RPN

在torchvision的源码中，有关RPN的类有两个，一个是RPNHead，另一个是RegionProposalNetwork，其源码链接在此。

class RPNHead(nn.Module):
    """
    Adds a simple RPN Head with classification and regression heads

    Arguments:
        in_channels (int): number of channels of the input feature
        num_anchors (int): number of anchors to be predicted
    """

    def __init__(self, in_channels, num_anchors):
        super(RPNHead, self).__init__()
        # 这种参数设置的3x3卷积并不改变特征层原有的长和宽
        self.conv = nn.Conv2d(
            in_channels, in_channels, kernel_size=3, stride=1, padding=1
        )
        # 这里是RPN的两个分支，一个分支负责分类，一个负责计算四个和目标位置有关的参数
        # 由于每个特征层上的每个点对应于原始图像上的每个锚点，所以只需要在接下来的网络设置合理的通道数
        # 从而得到每个锚框的类别参数和四个位置参数
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred = nn.Conv2d(
            in_channels, num_anchors * 4, kernel_size=1, stride=1
        )

        for layer in self.children():
            torch.nn.init.normal_(layer.weight, std=0.01)
            torch.nn.init.constant_(layer.bias, 0)

    def forward(self, x):
        # type: (List[Tensor]) -> Tuple[List[Tensor], List[Tensor]]
        logits = []
        bbox_reg = []
        for feature in x:
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

RPNHead是将得到的共享特征层先经过一个不改变其宽和高的 $3\times3$ Conv ，然后接下来是连个分支网络，两个分支网络均是 $1\times1$ Conv，不同的是负责分类的卷积核的输出通道数是 $num\_anchors$ ，而负责回归的卷积核的输出通道数是 $4*num\_anchors$ 。forward()的输入是共享特征层。由于特征金子塔的共享特征层是多个，所以在forward()方法中出现了循环操作。

张弘之

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
torchvision源码解读——Faster-RCNN

概述torchvision是用pytorch写成的著名的计算机视觉库，这篇博客介绍经典的目标检测算法Faster-RCNN，而torchvision中的Faster-RCNN选取的特征提取网络是ResNet-50,并且还加上了后来的研究成果FPN(特征金子塔)...
复制链接

扫一扫