torchvision Faster RCNN 代码详解

最新推荐文章于 2024-07-20 17:36:52 发布

GondorFu

最新推荐文章于 2024-07-20 17:36:52 发布

阅读量1.1k

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/a40850273/article/details/115194880

版权

深度学习专栏收录该内容

35 篇文章 0 订阅

订阅专栏

参考博客：

可参考代码：https://github.com/supernotman/Faster-RCNN-with-torchvision

调用方式

import torch
from torchvision import transforms
import torchvision
from PIL import Image

# For training
images, boxes = torch.rand(4, 3, 600, 1200), torch.rand(4, 11, 4)
labels = torch.randint(1, 91, (4, 11))
images = list(image for image in images)
targets = []
for i in range(len(images)):
    d = {}
    d['boxes'] = boxes[i]
    d['labels'] = labels[i]
    targets.append(d)
output = model(images, targets)

'''
output:
    {loss_classifier: *, loss_box_reg: *, loss_objectness: *, loss_rpn_box_reg: *}
'''

# For inference
model.eval()
img1 = transforms.ToTensor()(Image.open('./imgs/000000000036.jpg').convert('RGB'))
img2 = transforms.ToTensor()(Image.open('./imgs/000000000042.jpg').convert('RGB'))
x = [img1, img2]
predictions = model(x)

'''
predictions:
    [{'boxes': num_boxes * 4 (xyxy), 'labels': 1 * num_boxes, 'scores': 1 * num_boxes},
     {'boxes': num_boxes * 4 (xyxy), 'labels': 1 * num_boxes, 'scores': 1 * num_boxes}]
'''

其中 fasterrcnn_resnet50_fpn 会在默认使用 resnet50 作为 backbone 的基础上创建一个 FasterRCNN 的对象。FasterRCNN 在创建对应模块（rpn，roi_heads，transform）之后调用其基类 GeneralizedRCNN 完成对象生成。

GeneralizedRCNN 主要逻辑

images, targets = self.transform(images, targets)
features = self.backbone(images.tensors)
if isinstance(features, torch.Tensor):
    features = OrderedDict([('0', features)])
proposals, proposal_losses = self.rpn(images, features, targets)
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

losses = {}
losses.update(detector_losses)
losses.update(proposal_losses)

FasterRCNN 子模块生成

BackboneWithFPN (Backbone)

class BackboneWithFPN(nn.Module):
    """
    Adds a FPN on top of a model.
    Internally, it uses torchvision.models._utils.IntermediateLayerGetter to
    extract a submodel that returns the feature maps specified in return_layers.
    The same limitations of IntermediatLayerGetter apply here.
    Arguments:
        backbone (nn.Module)
        return_layers (Dict[name, new_name]): a dict containing the names
            of the modules for which the activations will be returned as
            the key of the dict, and the value of the dict is the name
            of the returned activation (which the user can specify).
        in_channels_list (List[int]): number of channels for each feature map
            that is returned, in the order they are present in the OrderedDict
        out_channels (int): number of channels in the FPN.
    Attributes:
        out_channels (int): the number of channels in the FPN
    """
    def __init__(self, backbone, return_layers, in_channels_list, out_channels):
        super(BackboneWithFPN, self).__init__()
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
        self.fpn = FeaturePyramidNetwork(
            in_channels_list=in_channels_list,
            out_channels=out_channels,
            extra_blocks=LastLevelMaxPool(),
        )
        self.out_channels = out_channels

    def forward(self, x):
        x = self.body(x)
        x = self.fpn(x)
        return x

参数说明：

return_layers：指定 backbone 返回的 conv 层，并给出返回的 OrderDict 的新 Key
in_channels_list：指定 fpn 输入各 feature map 层的 channel 数
out_channels：指定 fpn 输出的各 feature map 层的 channel 数，统一为一个数

其中例子中使用的 ResNet50 backbone 返回的 feature map size 依次减小为原来的 4,8,16,32 分之一，同时 channel 数依次增加为 256,512,1024,2048。不过经过 fpn 层后，所有最终返回的 feature map 的通道数都统一为了 256.

RegionProposalNetwork （RPN）

class RegionProposalNetwork(torch.nn.Module):
    """
    Implements Region Proposal Network (RPN).

    Arguments:
        anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature
            maps.
        head (nn.Module): module that computes the objectness and regression deltas
        fg_iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
            considered as positive during training of the RPN.
        bg_iou_thresh (float): maximum IoU between the anchor and the GT box so that they can be
            considered as negative during training of the RPN.
        batch_size_per_image (int): number of anchors that are sampled during training of the RPN
            for computing the loss
        positive_fraction (float): proportion of positive anchors in a mini-batch during training
            of the RPN
        pre_nms_top_n (Dict[int]): number of proposals to keep before applying NMS. It should
            contain two fields: training and testing, to allow for different values depending
            on training or evaluation
        post_nms_top_n (Dict[int]): number of proposals to keep after applying NMS. It should
            contain two fields: training and testing, to allow for different values depending
            on training or evaluation
        nms_thresh (float): NMS threshold used for postprocessing the RPN proposals

    """

        # RPN uses all feature maps that are available
        features = list(features.values())
        objectness, pred_bbox_deltas = self.head(features)
        anchors = self.anchor_generator(images, features)

        num_images = len(anchors)
        num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]
        num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]
        objectness, pred_bbox_deltas = \
            concat_box_prediction_layers(objectness, pred_bbox_deltas)
        # apply pred_bbox_deltas to anchors to obtain the decoded proposals
        # note that we detach the deltas because Faster R-CNN do not backprop through
        # the proposals
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)

        losses = {}
        if self.training:
            assert targets is not None
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
            loss_objectness, loss_rpn_box_reg = self.compute_loss(
                objectness, pred_bbox_deltas, labels, regression_targets)
            losses = {
                "loss_objectness": loss_objectness,
                "loss_rpn_box_reg": loss_rpn_box_reg,
            }
        return boxes, losses

处理流程：

1、回归 proposals 及对应前景概率：head

基于 backbone 提取到的每层 feature map 通过 Conv2d 分别回归每个特征点的每个 anchor 的类别（前景的概率）和偏差（预测框与 anchor）。对应输出结果的 size 不会变化，不过 channel 数类别变为指定每个特征点提取的 num_anchors 和 4 * num_anchors。

2、生成 anchors：anchor_generator

基于传入的每个特征点需要提取的 anchor_size 和 anchor_ratios 提取 anchors。TODO

3、基于提取的 anchors 和回归的 anchors_deltas 得到 proposals

基于 det_utils.BoxCoder 的 decoder 得到 proposals

4、对提取的 propoals 进行筛选：pre_nms_top_n，nms_thresh，post_nms_top_n

基于每层 anchors 提取 pre_nms_top_n 个前景概率最高的 anchors
去除可视范围内过小的 proposals
基于 nms_thresh 去除重叠的 proposals
保留全局最优的 post_nms_top_n 个 proposals

5、（训练）编码 anchors offsets：fg_iou_thresh, bg_iou_thresh

将 anchors 与真值关联，det_utils.Matcher
det_utils.BoxCoder 的 encoder 得到每个 anchor 对应真值的 offsets

6、（训练）计算 loss：batch_size_per_image, positive_fraction

前景背景均衡化采样：BalancedPositiveNegativeSampler
基于 L1 计算回归 loss，交叉熵计算分类 loss

RoIHeads

class RoIHeads(torch.nn.Module):
    def __init__(self,
                 box_roi_pool,
                 box_head,
                 box_predictor,
                 # Faster R-CNN training
                 fg_iou_thresh, bg_iou_thresh,
                 batch_size_per_image, positive_fraction,
                 bbox_reg_weights,
                 # Faster R-CNN inference
                 score_thresh,
                 nms_thresh,
                 detections_per_img,
                 )

        if self.training:
            proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
            
        else:
            labels = None
            regression_targets = None
            matched_idxs = None

        box_features = self.box_roi_pool(features, proposals, image_shapes)
        box_features = self.box_head(box_features)
        class_logits, box_regression = self.box_predictor(box_features)

        result = torch.jit.annotate(List[Dict[str, torch.Tensor]], [])
        losses = {}
        if self.training:
            assert labels is not None and regression_targets is not None
            loss_classifier, loss_box_reg = fastrcnn_loss(
                class_logits, box_regression, labels, regression_targets)
            losses = {
                "loss_classifier": loss_classifier,
                "loss_box_reg": loss_box_reg
            }
        else:
            boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
            num_images = len(boxes)
            for i in range(num_images):
                result.append(
                    {
                        "boxes": boxes[i],
                        "labels": labels[i],
                        "scores": scores[i],
                    }
                )

处理流程：

1、（训练）选取训练样本：fg_iou_thresh, bg_iou_thresh, batch_size_per_image, positive_fraction

proposals 与真值匹配
不同类别目标均衡化采样
基于采样后的 proposals 的 offset

2、roi_align：box_roi_pool

根据 proposals 对 feature map 进行尺寸归一化：MultiScaleRoIAlign

3、特征提取：box_head

归一化特征经过两层 MLP 提取特征

4、回归框的类别和 offset：box_predictor

针对每个 proposals 预测属于每一类的概率和 offset

5、（训练）计算 loss

基于交叉熵计算分类 loss
基于 L1 计算 offset loss

6、框 offset decode 后处理：score_thresh, nms_thresh, detections_per_img

decoder 得到预测框
softmax 得到分类类别
基于 score_thresh 去除得分过低的预测
去除有效范围内过小的预测
基于 nms_thresh 执行 NMS
选取得分最高的 detections_per_img 个预测

GeneralizedRCNNTransform （transform）

class GeneralizedRCNNTransform(nn.Module):
    """
    Performs input / target transformation before feeding the data to a GeneralizedRCNN
    model.

    The transformations it perform are:
        - input normalization (mean subtraction and std division)
        - input / target resizing to match min_size / max_size

    It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets
    """