Faster-RCNN

Paper-info

  • title : Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[2016-NIPS]
  • author : Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun.
  • Github : fasterrcnn-pytorch

                                  

Motivation

       在引出Faster R-CNN之前,我们首先快速回顾一下R-CNN和Fast R-CNN的主要环节:

# R-CNN
ROIs = region_proposal(image)     # Selective Search etc...
for ROI in ROIs
    patch = get_patch(image, ROI) # extract feature
    results = detector(patch)

# Fast R-CNN
feature_maps = process(image)     # extract feature for image only once!
ROIs = region_proposal(image)     # time consuming !!! [ Selective Search etc...]
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)

下面是R-CNN和Fast R-CNN的网络框架图:

  • R-CNN

                

 

  • Fast R-CNN

Fast R-CNN的测试速度依赖于region proposal的生成方法(eg: selective search等),而这些方法在GPU机器上运行很慢,在测试环节为:整个测试流程耗时2.3s,其中2000个ROIs的产生环节就耗时约2s!通过以上的回顾,我们知道Fast R-CNN将R-CNN的multi-stages整合到一个网络中,从而实现了end-to-end的训练模式,但是它的inference速度很慢(0.5 FPS,主要的原因是产生proposal环节耗时严重),这大大限制了模型的实际应用场景。针对Fast R-CNN的这一缺陷,Faster R-CNN创造性地提出了anchor-based - RPN(Region Proposal Net),优雅而高效地解决了Fast R-CNN的痛点,成为了R-CNN系列方法的巅峰之作,同时也是目前detection领域最常用的模型之一。

Idea

       首先来看,selective search方法的大致做法[2],它的本质是图像区域由小到大地进行动态聚类,然后输出在这个动态聚类的过程中生成的各个区域对应的bbox作为proposal。Selective search工作机理的示意图如下:

为了加速proposal的生成过程,Faster RCNN中创造性地提出了RPN(一个内置的浅层网络),通过引入anchor机制,可以为产生多样化的proposal。具体操作为:

step - 1. 对于feature_map上的每个pixel,产生k个多样化[由[size] x [aspect_ratio]控制, 每个pixel的base_anchor都一样]base anchor,再根据feature_map与input_img的尺度关系,将base anchor转为基于原图尺寸的anchor。 

step - 2. RPN中包含两个支路,分别输出anchor对应的objectness和offset

              [为什么这里k个anchor要产生2k个score呢?]

                                                       

step - 3. 根据offset对anchor进行修正,注意offset的内容是:

                              

这里记得整理Faster RCNN的loss-function。

def decode_single(self, rel_codes, boxes):
        '''
        From a set of original boxes and encoded relative box offsets,
        get the decoded boxes.

        Arguments:
            rel_codes (Tensor): bbox_offset
            boxes (Tensor)    : anchor boxes.
        '''

        boxes = boxes.to(rel_codes.dtype)

        widths  = boxes[:, 2] - boxes[:, 0]
        heights = boxes[:, 3] - boxes[:, 1]
        ctr_x = boxes[:, 0] + 0.5 * widths
        ctr_y = boxes[:, 1] + 0.5 * heights

        wx, wy, ww, wh = self.weights  # 默认值为[1,1,1,1]
        dx = rel_codes[:, 0::4] / wx
        dy = rel_codes[:, 1::4] / wy
        dw = rel_codes[:, 2::4] / ww
        dh = rel_codes[:, 3::4] / wh

        # Prevent sending too large values into torch.exp()
        dw = torch.clamp(dw, max=self.bbox_xform_clip)
        dh = torch.clamp(dh, max=self.bbox_xform_clip)

        pred_ctr_x = dx * widths[:, None]  + ctr_x[:, None]
        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
        pred_w = torch.exp(dw) * widths[:, None]
        pred_h = torch.exp(dh) * heights[:, None]

        pred_boxes = torch.zeros_like(rel_codes)
        pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
        pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
        pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
        pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h

        return pred_boxes

step - 4. 利用objectness、nms策略对anchor进行筛选,筛选剩下的即为RPN输出的proposal。 

loss function

                                   

其中:

p_i 表示 anchor是一个object的预测概率;

p_star_i 表示 groundtruth(binary-label, 1:positive anchor; 0:negative anchor);

t_i 表示pred_bbox参数{t_x, t_y, t_w, t_h};

t_star_i 表示gt_bbox{t_star_x, t_star_y, t_star_w, t_star_h};

                 

L_cls(p_i, p_star_i)表示分类损失,采用log-loss;

L_reg(t_i, t_star_i)表示回归损失,采用Fast R-CNN中定义的smooth L1损失函数:

                                              

 

Outline

feature_maps = process(image)
ROIs         = RPN(image)     # Faster!
for ROI in ROIs
    patch   = roi_pooling(feature_maps, ROI)
    results = detector2(patch)

torchvision源码分析

Anchor 生成

class AnchorGenerator(nn.Module):
    """
    Module that generates anchors for a set of feature maps and image sizes.

    The module support computing anchors at multiple sizes and aspect ratios per feature map.

    sizes and aspect_ratios should have the same number of elements, and it should
    correspond to the number of feature maps.

    sizes[i] and aspect_ratios[i] can have an arbitrary number of elements,
    and AnchorGenerator will output a set of sizes[i] * aspect_ratios[i] anchors
    per spatial location for feature map i.

    Arguments:
        sizes (Tuple[Tuple[int]]):
        aspect_ratios (Tuple[Tuple[float]]):
    """

    def __init__(self, sizes=(128, 256, 512), aspect_ratios=(0.5, 1.0, 2.0)):

        super(AnchorGenerator, self).__init__()

        ...

        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.cell_anchors = None
        self._cache = {}

    @staticmethod
    def generate_anchors(scales, aspect_ratios, device="cpu"):
        '''
        Generate the base_anchor
        '''
        scales = torch.as_tensor(scales, dtype=torch.float32, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=torch.float32, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1 / h_ratios

        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2 #center_point
        return base_anchors.round()


    def set_cell_anchors(self, device):
        ''' Generate the anchors '''

        if self.cell_anchors is not None:
            return self.cell_anchors

        cell_anchors = [
            self.generate_anchors(sizes, aspect_ratios, device)
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
        ]
        self.cell_anchors = cell_anchors


    def num_anchors_per_location(self):
        return [len(s) * len(a) for s, a in zip(self.sizes, self.aspect_ratios)]


    def grid_anchors(self, grid_sizes, strides):
        '''
        Generate the anchors according to base_anchor and stride
        grid_size : size of feature_map
        stride    : map_stride between img and feature_map
        '''
        anchors = list()
        for size, stride, base_anchors in zip(grid_sizes, strides, self.cell_anchors):

            grid_height, grid_width = size
            stride_height, stride_width = stride
            device = base_anchors.device
            shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
            shifts_y = torch.arange(0, grid_height,dtype=torch.float32, device=device) * stride_height
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            shifts  = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            anchors.append((shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4))

        return anchors


    def cached_grid_anchors(self, grid_sizes, strides):

        key = tuple(grid_sizes) + tuple(strides)
        if key in self._cache:
            return self._cache[key]
        anchors = self.grid_anchors(grid_sizes, strides)   #
        self._cache[key] = anchors
        return anchors


    def forward(self, image_list, feature_maps):
        '''
        step - 1. generate the base_anchor according to scales and aspect_ratios
        step - 2. generate the anchors over all feature maps for each type of image_size
        NOTE : there may be multi-scales feature_map
        '''

        # step - 1
        self.set_cell_anchors(feature_maps[0].device)

        # step - 2
        grid_sizes = tuple([feature_map.shape[-2:] for feature_map in feature_maps])
        image_size = image_list.tensors.shape[-2:]
        # calculate the map-stride between img_size and feature_map
        strides    = tuple((image_size[0] / g[0], image_size[1] / g[1]) for g in grid_sizes)
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)

        anchors = []
        # deal with each instance in batch
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                anchors_in_image.append(anchors_per_feature_map)
            anchors.append(anchors_in_image)
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        return anchors
  • RPN-header
class RPNHead(nn.Module):
    """
    Adds a simple RPN Head with classification and regression heads

    Arguments:
        in_channels (int): number of channels of the input feature
        num_anchors (int): number of anchors to be predicted
    """

    def __init__(self, in_channels, num_anchors):

        super(RPNHead, self).__init__()

        self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred  = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)
        
        # 初始化卷积核
        for l in self.children():
            torch.nn.init.normal_(l.weight, std=0.01)
            torch.nn.init.constant_(l.bias, 0)

    def forward(self, x):

        logits, bbox_reg = [], []
        for feature in x:
            t = F.relu(self.conv(feature))      # 得到out_channel=256,大小不变的feature_map 
            logits.append(self.cls_logits(t))   # 产生objectness-score
            bbox_reg.append(self.bbox_pred(t))  # 产生anchor_box的offset
        return logits, bbox_reg
  • Region Proposal Networks
class RegionProposalNetwork(torch.nn.Module):
    
    ...

    def forward(self, images, features, targets=None):
        """
        Arguments:
            images (ImageList): images for which we want to compute the predictions
            features (List[Tensor]): features computed from the images that are
                used for computing the predictions. Each tensor in the list
                correspond to different feature levels
            targets (List[Dict[Tensor]): ground-truth boxes present in the image (optional).
                If provided, each element in the dict should contain a field `boxes`,
                with the locations of the ground-truth boxes.

        Returns:
            boxes (List[Tensor]): the predicted boxes from the RPN, one Tensor per
                image.
            losses (Dict[Tensor]): the losses for the model during training. During
                testing, it is an empty dict.
        """
        # 主干网络抽得的feature_map
        features = list(features.values())
        
        # 基于该feature_map产生anchor
        anchors = self.anchor_generator(images, features)
        
        # 利用RPN-Header产生anchor对应的objectness、bbox_offset
        objectness, pred_bbox_deltas = self.head(features)
        
        # 这里的feature_map是过了FPN的,所以会有多个scale的feature_map, concat这些信息
        num_images = len(anchors)
        num_anchors_per_level = [obj[0].numel() for obj in objectness]
        objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness, pred_bbox_deltas)
        
        # 利用RPN_header产生的bbox_offset来refine anchor,从而产生初始的proposal 
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        
        # 利用bbox的objectness以及nms来筛选掉不靠谱的bbox
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, \
                                                  num_anchors_per_level)
        losses = {}
        if self.training:
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
            loss_objectness, loss_rpn_box_reg = self.compute_loss(objectness, pred_bbox_deltas, \
                                                    labels, regression_targets)
            losses = {
                "loss_objectness": loss_objectness,
                "loss_rpn_box_reg": loss_rpn_box_reg,
            }
        return boxes, losses

Experiment

                             

                                    

Conclusion

  • RPN-class scores account for the accuracy of the highest ranked proposals;
  • High-quality proposals are mainly due to RPN-regressed positions, anchor boxes alone are not sufficient for accurate detection

Reference

[1]. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[2016-NIPS]

[2]. J.R.R.Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for Object Recognition[2013IJCV]

[3]. https://medium.com/@jonathan_hui/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
Faster-RCNN是一种用于目标检测的深度学习网络模型。它是在R-CNN和Fast RCNN的基础上发展而来的,通过将特征抽取、proposal提取、bounding box regression和classification整合在一个网络中,提高了综合性能和检测速度。[2] Faster-RCNN的训练过程可以分为以下几个步骤: 1. 使用一个预训练的卷积神经网络(如VGG16)来提取图像的特征。 2. 在特征图上使用Region Proposal Network (RPN) 来生成候选目标框(proposals)。 3. 使用这些候选目标框和真实标签来计算损失函数,并通过反向传播来更新网络参数,以使网络能够更好地预测目标框的位置和类别。 4. 使用训练好的RPN来生成候选目标框,并将这些候选目标框输入到网络中进行分类和边界框回归。 5. 通过计算损失函数并反向传播来更新网络参数,以进一步提高检测性能。 6. 可以进行多次迭代的训练,每次迭代都使用之前训练好的网络来初始化网络参数,并继续训练网络。[3] Faster-RCNN的网络结构包括一个共享的卷积层(用于特征提取)和两个分支:一个用于生成候选目标框的RPN,另一个用于对这些候选目标框进行分类和边界框回归。通过共享卷积层,Faster-RCNN能够在不同尺度的特征图上进行目标检测,从而提高检测的准确性和效率。[2] 总结来说,Faster-RCNN是一种用于目标检测的深度学习网络模型,通过整合特征抽取、proposal提取、bounding box regression和classification,提高了综合性能和检测速度。它的训练过程包括特征提取、候选目标框生成、分类和边界框回归等步骤。[2][3]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ReLuJie

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值