Faster R-CNN 与 RPN

图波列夫

已于 2023-08-12 10:02:32 修改

阅读量4.9k

点赞数 1

分类专栏： Caffe2 DeepLearning ObjectDetection 文章标签：目标检测 Faster R-CNN Detectron Caffe2

于 2018-08-01 14:00:37 首次发布

本文链接：https://blog.csdn.net/yiran103/article/details/80997760

版权

DeepLearning 同时被 3 个专栏收录

75 篇文章 9 订阅

订阅专栏

Caffe2

10 篇文章 0 订阅

订阅专栏

ObjectDetection

9 篇文章 0 订阅

订阅专栏

Fast R-CNN 实现了候选框的特征图共享，大幅提高了训练及部署的效率。然而，网络输入仍然依赖 Selective Search 等方法，在整个系统中耗时占比较高且优化空间有限。

Faster R-CNN 使用 RPN 网络生成候选区域。RPN 与第2阶段的 Fast R-CNN 共享特征图，使得效率再次跃升。

Faster R-CNN 的整体框架如下图所示。
architecture

RPN 告诉检测器需要看哪里，相当于为检测器添加了注意力机制。

The RPN takes the convolutional feature map and generates proposals over the image

如果将 Faster R-CNN 中的第2阶段看作是 Fast R-CNN 在特征图上滑窗；YOLO 则以全连接层直接接入了整张图像；SSD 像是多类别多特征图的 RPN；FPN 接入金字塔融合特征。

RPN

RPN 网络结构如下图所示：

Convolutional implementation of an RPN architecture, where k is the number of anchors.

RPN 是一个全卷积网络，可同时预测每个位置的目标边界和分数，生成高质量的区域提案，供 Fast R-CNN 用于检测。RPN 可以和 Fast R-CNN 进行联合训练，实现端到端优化。

基于区域的检测器（如 Fast R-CNN ）所使用的卷积特征映射也可用于生成区域提议。在这些卷积特征之上，可以通过添加一些额外的卷积层来构建 RPN，这些卷层同时回归规则网格上每个位置处的区域边界和目标得分。RPN 将图像（任意大小）作为输入并输出一组矩形目标提议，每个提议都有一个目标分数。Faster R-CNN 在实验中研究了两种 backbone 网络——ZFNet（5个可共享的卷积层）和 VGG-16（13个可共享的卷积层）。

RPN 旨在有效地预测具有各种尺度和纵横比的区域提议。与使用图像金字塔或金字塔滤波器的流行方法(DPM、OverFeat、SPPNet、Fast R-CNN)相比，其引入了“anchor”框，作为多尺度和纵横比的参考。RPN 可以看作是回归参考的金字塔，它避免了枚举不同尺度或纵横比的图像或滤波器。该模型使用单尺度图像进行训练和测试，性能良好且有利于提升运行速度。

为了生成区域提议，文章在最后一个共享卷积层输出的卷积特征图上滑动一个小网络。这个小网络以特征图上的 $\times n$ 空间窗口作为输入。每个滑动窗口映射到一个较低维度的特征（ZFNet 为256-d，VGG-16为512-d，后面是 ReLU）。将此特征送入两个并蒂的全连接层——一个目标框回归层(reg)和一个目标框分类层(cls)。我们在本文中使用 $n = 3$ ，注意到输入图像上的有效感受野很大（ZFNet 为171个像素，VGG-16 为228像素）。这个迷你网络的单个位置如下图所示。
rpn
请注意，由于迷你网络以滑动窗口方式运行，因此全连接层在所有空间位置共享。这种架构自然地用 $\times n$ 卷积层实现，然后是两个并蒂 $\times 1$ 卷积层（分别用于 reg 和 cls）。ReLU 应用于 $\times n$ conv 层的输出。

Anchors

Anchor centers throught the original image

在每个滑动窗口位置，RPN 同时预测多个区域提案。假设每个位置的最大可能提案数量为 $k$ ，则 reg 层有 $4 k$ 输出编码 $k$ 个 box 的坐标，cls 层输出 $2 k$ 个得分，用于估算每个提案是否为目标的概率¹。 $k$ 个提议相对于 $k$ 个参考框参数化，我们称之为 锚点(anchors)。锚点位于所讨论滑动窗口的中心，并且与尺度和纵横比相关联。默认情况下，Faster R-CNN 使用3个尺度和3个宽高比，在每个滑动位置产生 $k = 9$ 个锚点。对于大小为 $\times H$ （通常 $\sim$ 2,400）的卷积特征映射，总共有 $W H k$ 个锚点。

平移不变性

锚点以及基于锚点计算提议的设计使得方法具有平移不变性。如果平移图中的对象，则提案也会平移，并且相同的函数应该能够在任一位置预测提案。作为比较，MultiBox 方法使用 k-means 生成800个锚点，这些锚点不是平移不变量。因此，如果平移对象，MultiBox 不保证生成相同的提议。

平移不变属性还会减小模型大小。MultiBox 有一个 $(4+1)\times800$ 维的全连接输出层，而 RPN 在 $k = 9$ 锚点的情况下有一个 $(4+2)\times9$ 维卷积输出层。结果，RPN 输出层有 $2.8\times10^4$ 个参数（VGG-16 为 $512\times(4+2)\times9$ ），比 MultiBox 的输出层少了两个数量级，其输出为 $6.1\times10^6$ 个参数（GoogleNet 的为 $1536\times(4+1)\times800$ ）。

如果考虑特征投影层，RPN 的提议层的参数仍然比 MultiBox 少一个数量级。RPN 的提案层的参数计数是 $3\times3\times512\times512+512\times6\times9=2.4\times10^6$ ； MultiBox 的提议层的参数计数为 $7\times7\times(64+96+64+64)\times1536+1536\times5\times800=27\times10^6$ 。所以 RPN 在 PASCAL VOC 等小数据集上的过度拟合风险较小。

锚点金字塔

目标尺度及纵横比的多样性是检测领域的一个难点。之前通行的途径是构造图像（特征）金字塔( DPM、OverFeat、SPPNet、Fast R-CNN)或滤波器金字塔(DPM)。这中间的重复计算会非常耗时。RPN 在每个位置参考多尺度和纵横比的锚框对边界框进行分类和回归，相当于在单尺度图像上建立了锚点金字塔。RPN 仅最后的 reg 和 cls 层随 $k$ 线性增长，比前两种方法更为高效。
pyramids

损失函数

为了训练 RPN，我们为每个锚点分配一个二进制类标签（是否为目标）。我们为两种锚点分配一个正标签：

与真实框具有最高交并比（IoU）的锚点；
与某一真实框重叠 IoU 高于0.7的锚点。

请注意，单个真实框可以为多个锚点分配正标签。

通常第二个条件足以确定正样本；但我们仍采用第一个条件，因为在极少数情况下，第二个条件可能没有找到正样本。

如果与所有真实框的 IoU 比率低于0.3，我们会为非正锚点分配负标签。既不是正也不是负的锚点对训练目标没有贡献。

通过这些定义，我们根据 Fast R-CNN 中的多任务损失来最小化目标函数。我们对图像的损失函数定义为：

$L(\{p_i\}, \{t_i\}) = \frac{1}{N_{cls}}\sum_i L_{cls}(p_i, p^{*}_i) \\ + \lambda\frac{1}{N_{reg}}\sum_i p^{*}_i L_{reg}(t_i, t^{*}_i).$

这里， $i$ 是 mini-batch 中锚点的索引， $p_i$ 是锚点 $i$ 的目预测概率。如果锚点为正，则真实标签 $p^{\ast}_i$ 为1；如果锚点为负，则为0。 $t_i$ 是表示预测边界框的4个参数化坐标的向量， $t^{\ast}_i$ 是与正锚点相关联的真实框的向量。分类损失 $L_{cls}$ 是两个类（目标 vs 非目标）的对数损失。对于回归损失，我们使用 $L_{reg}(t_i, t^{\ast}_i)=R(t_i - t^{\ast}_i)$ 其中 $R$ 是鲁棒损失函数（smooth L $_1$ ）在 Fast R-CNN 中定义。累加项 $p^{\ast}_i L_{reg}$ 表示仅对正锚( $p^{*}_i=1$ )激活回归损失，否则禁用( $p^{*}_i=0$ )。cls 和 reg 层的输出分别由 ${p_i\}$ 和 ${t_i\}$ 组成。

公式中两项分别由 $N_{cls}$ 和 $N_{reg}$ 归一化，并由平衡参数 $\lambda$ 加权。在论文当前的实现中（如在已发布的代码中），上式中的 $c l s$ 项由 mini-batch 的大小归一化（例如， $N_{cls}=256$ ）， $re g$ 项由锚点位置的数量归一化（即 $N_{reg}\sim2,400$ ）。默认情况下，我们设置 $\lambda=10$ ，因此 cls和 reg 两项的加权大致相等。我们通过实验证明，结果对 $\lambda$ 的值在很大范围内不敏感。我们还注意到，上述标准化不是必需的，可以简化。

对于边界框回归，Faster R-CNN 遵从 R-CNN 中的4坐标参数化方法：

$\begin{aligned} t_{\textrm{x}} &= (x - x_{\textrm{a}})/w_{\textrm{a}},\quad t_{\textrm{y}} = (y - y_{\textrm{a}})/h_{\textrm{a}},\\ t_{\textrm{w}} &= \log(w / w_{\textrm{a}}), \quad \enspace t_{\textrm{h}} = \log(h / h_{\textrm{a}}),\\ t^{*}_{\textrm{x}} &= (x^{*} - x_{\textrm{a}})/w_{\textrm{a}},\quad t^{*}_{\textrm{y}} = (y^{*} - y_{\textrm{a}})/h_{\textrm{a}},\\ t^{*}_{\textrm{w}} &= \log(w^{*} / w_{\textrm{a}}),\quad \enspace t^{*}_{\textrm{h}} = \log(h^{*} / h_{\textrm{a}}), \end{aligned}$

其中 $x$ , $y$ , $w$ 和 $h$ 表示盒子的中心坐标及其宽度和高度。变量 $x$ , $x_{\textrm{a}}$ 和 $x^{\ast}$ 分别用于预测框、锚框和真实框（同样适用于 $y, w, h$ ）。这可以被认为是从锚框到附近的真实框的边界框回归。

然而，RPN 回归边界框的方式与先前基于 RoI（感兴趣区域）方法(SPPNet，Fast R-CNN)不同。SPPNet 和 Fast R-CNN 对来自任意大小的 RoI 的特征执行边界框回归，并且不同区域大小共享回归权重。而 RPN 用于回归的特征在特征图上具有相同空间大小（ $\times 3$ ）。为了考虑不同的大小，学习 $k$ 个边界框回归器。每个回归器负责一个比例和一个纵横比，而 $k$ 个回归器不共享权重。因此，由于锚点的设计，即使特征具有固定尺寸和比例，仍然可以预测各种尺寸的盒子。

RPN 的训练

RPN 可以通过反向传播和随机梯度下降(SGD)进行端到端训练。我们遵循 Fast R-CNN“以图像为中心”的采样策略来训练这个网络。每个 mini-batch 包含来自单个图像的正负锚点样本。可以优化所有锚点的损失函数，但这将偏向负样本，因为它们占主导地位。相反，我们在图像中随机采样256个锚点来计算 mini-batch 的损失函数，其中采样的正负锚点的比率最高为 1：1。如果图像中的正样本少于128个，我们将在 mini-batch 中填充负样本。（此处为 MegDet 的一个着眼点。）

至此，论文中 RPN 描述完毕，然而还有一些事项未有涉及，比较重要的例如 NMS（on-maximum suppression，非极大抑制）。这正是检测与分类不同的地方，有很多 CNN 中的非标准化操作。由于定义的锚点框相互重叠，同一目标会产生多个提议。NMS 获取按照分数排序的建议列表并对已排序的列表进行迭代，丢弃那些 IoU 值大于某个预定义阈值的建议，最终保留分数最高的且不重叠的前 k 个提议。详细操作会结合程序进行介绍。

add_generic_rpn_outputs

在 Detectron 的 rpn_heads.py 中，add_generic_rpn_outputs 向 RPN 模型添加输出（目标分类和边界框回归）。
抽象 FPN 的使用。

如果使用 FPN，委派给 FPN 模块。

    loss_gradients = None
    if cfg.FPN.FPN_ON:
        # Delegate to the FPN module
        FPN.add_fpn_rpn_outputs(model, blob_in, dim_in, spatial_scale_in)
        if cfg.MODEL.FASTER_RCNN:
            # CollectAndDistributeFpnRpnProposals also labels proposals when in
            # training mode
            model.CollectAndDistributeFpnRpnProposals()
        if model.train:
            loss_gradients = FPN.add_fpn_rpn_losses(model)

否则，添加单尺度输出。

    else:
        # Not using FPN, add RPN to a single scale
        add_single_scale_rpn_outputs(model, blob_in, dim_in, spatial_scale_in)
        if model.train:
            loss_gradients = add_single_scale_rpn_losses(model)
    return loss_gradients

add_single_scale_rpn_outputs

generate_anchors以(x1, y1, x2, y2)格式生成锚框矩阵。锚点以 stride/2为中心，具有指定大小的（近似）开方区域及给定的宽高比。

    anchors = generate_anchors(
        stride=1. / spatial_scale,
        sizes=cfg.RPN.SIZES,
        aspect_ratios=cfg.RPN.ASPECT_RATIOS
    )
    num_anchors = anchors.shape[0]
    dim_out = dim_in

添加一个 Conv 和 Relu。

    # RPN hidden representation
    model.Conv(
        blob_in,
        'conv_rpn',
        dim_in,
        dim_out,
        kernel=3,
        pad=1,
        stride=1,
        weight_init=gauss_fill(0.01),
        bias_init=const_fill(0.0)
    )
    model.Relu('conv_rpn', 'conv_rpn')

1X1卷积对建议框分类。

    # Proposal classification scores
    model.Conv(
        'conv_rpn',
        'rpn_cls_logits',
        dim_in,
        num_anchors,
        kernel=1,
        pad=0,
        stride=1,
        weight_init=gauss_fill(0.01),
        bias_init=const_fill(0.0)
    )

1X1卷积回归建议框坐标。

    # Proposal bbox regression deltas
    model.Conv(
        'conv_rpn',
        'rpn_bbox_pred',
        dim_in,
        4 * num_anchors,
        kernel=1,
        pad=0,
        stride=1,
        weight_init=gauss_fill(0.01),
        bias_init=const_fill(0.0)
    )

在 Faster R-CNN 模式或者RPN的推理阶段需要提案，而 RPN 训练不需要。
使用 Sigmoid 函数预测分数，生成提案。

    if not model.train or cfg.MODEL.FASTER_RCNN:
        # Proposals are needed during:
        #  1) inference (== not model.train) for RPN only and Faster R-CNN
        #  OR
        #  2) training for Faster R-CNN
        # Otherwise (== training for RPN only), proposals are not needed
        model.net.Sigmoid('rpn_cls_logits', 'rpn_cls_probs')
        model.GenerateProposals(
            ['rpn_cls_probs', 'rpn_bbox_pred', 'im_info'],
            ['rpn_rois', 'rpn_roi_probs'],
            anchors=anchors,
            spatial_scale=spatial_scale
        )

如果是 Faster R-CNN 模式，训练需要生成提案的label，推理则对 blob 重命名。

    if cfg.MODEL.FASTER_RCNN:
        if model.train:
            # Add op that generates training labels for in-network RPN proposals
            model.GenerateProposalLabels(['rpn_rois', 'roidb', 'im_info'])
        else:
            # Alias rois to rpn_rois for inference
            model.net.Alias('rpn_rois', 'rois')

add_single_scale_rpn_losses

在空间上缩小全尺寸 RPN 标签阵列以匹配特征图形状。_get_rpn_blobs会对rpn_labels_int32_wide进行赋值。

    # Spatially narrow the full-sized RPN label arrays to match the feature map
    # shape
    model.net.SpatialNarrowAs(
        ['rpn_labels_int32_wide', 'rpn_cls_logits'], 'rpn_labels_int32'
    )

得到的 rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights 用于 SmoothL1Loss。

    for key in ('targets', 'inside_weights', 'outside_weights'):
        model.net.SpatialNarrowAs(
            ['rpn_bbox_' + key + '_wide', 'rpn_bbox_pred'], 'rpn_bbox_' + key
        )

    loss_rpn_cls = model.net.SigmoidCrossEntropyLoss(
        ['rpn_cls_logits', 'rpn_labels_int32'],
        'loss_rpn_cls',
        scale=model.GetLossScale()
    )

    loss_rpn_bbox = model.net.SmoothL1Loss(
        [
            'rpn_bbox_pred', 'rpn_bbox_targets', 'rpn_bbox_inside_weights',
            'rpn_bbox_outside_weights'
        ],
        'loss_rpn_bbox',
        beta=1. / 9.,
        scale=model.GetLossScale()
    )

get_loss_gradients 为loss_blobs中指定的每个损失生成1的梯度。
AddLosses 添加损失到列表。

    loss_gradients = blob_utils.get_loss_gradients(
        model, [loss_rpn_cls, loss_rpn_bbox]
    )
    model.AddLosses(['loss_rpn_cls', 'loss_rpn_bbox'])
    return loss_gradients

GenerateProposals

blobs_in：

rpn_cls_probs：4D 形状张量(N, A, H, W)，其中 N 是 minibatch 图像的数量，A 是每个位置的锚点数，(H, W)是预测格的空间大小。每个值代表[0,1]之间的“目标概率”估计。
rpn_bbox_pred：预测增量的4D 张量形状(N, 4 * A, H, W)，将锚框的转换为 RPN 提议。
im_info：2D 张量形状(N, 3)，其中三列编码输入图像的[高度，宽度，比例]。高度和宽度用于输入网络而不是原始图像；比例是用于将原始图像缩放到网络输入大小的比例因子。

blobs_out：

rpn_rois：2D 张量形状(R, 5)，对于 R 个 RPN 提议，其五列编码为[batch ind，x1，y1，x2，y2]。这些盒子参照网络输入，是原始图像的缩放版本；这些建议必须按1 / scale（scale 来自im_info，见上文）进行缩放，以将其转换回原始输入图像坐标系。
rpn_roi_probs：1D 目标概率分数的张量（从rpn_cls_probs中提取，见上文）。

GenerateProposals

net.Python是什么？

        name = 'GenerateProposalsOp:' + ','.join([str(b) for b in blobs_in])
        # spatial_scale passed to the Python op is only used in convert_pkl_to_pb
        self.net.Python(
            GenerateProposalsOp(anchors, spatial_scale, self.train).forward
        )(blobs_in, blobs_out, name=name, spatial_scale=spatial_scale)
        return blobs_out

GenerateProposalsOp

通过将估计的边界框变换到一组规则盒子（称为“锚点”）来输出目标检测提议。
train应该是布尔型。

    def __init__(self, anchors, spatial_scale, train):
        self._anchors = anchors
        self._num_anchors = self._anchors.shape[0]
        self._feat_stride = 1. / spatial_scale
        self._train = train

forward

对于(H, W)网格中的每个位置i：
a. 生成以单元格 i 为中心的锚点框
b. 将预测的 bbox 改变量应用于单元格 i 中的每个锚点
截取出预测框中的图像；
移除高度或宽度小于阈值的预测框；
按分数从高到低顺序排列所有（提案，得分）对；
在 NMS 之前取头部的pre_nms_topN提议；
对其余提议采用宽松阈值（0.7）的 NMS；
在 NMS 之后提取 after_nms_topN 提议；
返回头部提议。

第一个输入为预测分数，第二个输入为预测的锚点变换，第三个输入为图像。

        # predicted probability of fg object for each RPN anchor
        scores = inputs[0].data
        # predicted achors transformations
        bbox_deltas = inputs[1].data
        # input image (height, width, scale), in which scale is the scale factor
        # applied to the original dataset image to get the network input image
        im_info = inputs[2].data

构造出(H, W)网格上每个点在特征图上的偏移量。numpy.arange左闭右开。

        # 1. Generate proposals from bbox deltas and shifted anchors
        height, width = scores.shape[-2:]
        # Enumerate all shifted positions on the (H, W) grid
        shift_x = np.arange(0, width) * self._feat_stride
        shift_y = np.arange(0, height) * self._feat_stride

生成网格坐标，但是shift_x和shift_y的名字没变，因为copy=False的原因？
shifts构造出不同锚点的坐标偏移。

        shift_x, shift_y = np.meshgrid(shift_x, shift_y, copy=False)
        # Convert to (K, 4), K=H*W, where the columns are (dx, dy, dx, dy)
        # shift pointing to each grid location
        shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                            shift_x.ravel(), shift_y.ravel())).transpose()

在shifts上广播锚点，获得所有位置上的所有锚点。
在(H, W)网格中：
- 将A个形状(1, A, 4)的锚点添加到形状(K, 1, 4)的K个位移中，以获得形状为(K, A, 4)的所有移位锚，并将其重塑为(K*A, 4)。
为什么是inputs[0] 而不是inputs[2]？

        # Broacast anchors over shifts to enumerate all anchors at all positions
        # in the (H, W) grid:
        #   - add A anchors of shape (1, A, 4) to
        #   - K shifts of shape (K, 1, 4) to get
        #   - all shifted anchors of shape (K, A, 4)
        #   - reshape to (K*A, 4) shifted anchors
        num_images = inputs[0].shape[0]
        A = self._num_anchors
        K = shifts.shape[0]
        all_anchors = self._anchors[np.newaxis, :, :] + shifts[:, np.newaxis, :]
        all_anchors = all_anchors.reshape((K * A, 4))

调用proposals_for_one_image计算一张图片上的边界框和预测。

        rois = np.empty((0, 5), dtype=np.float32)
        roi_probs = np.empty((0, 1), dtype=np.float32)
        for im_i in range(num_images):
            im_i_boxes, im_i_probs = self.proposals_for_one_image(
                im_info[im_i, :], all_anchors, bbox_deltas[im_i, :, :, :],
                scores[im_i, :, :, :]
            )

索引类型为什么是np.float32？
将结果追加到rois和roi_probs。

            batch_inds = im_i * np.ones(
                (im_i_boxes.shape[0], 1), dtype=np.float32
            )
            im_i_rois = np.hstack((batch_inds, im_i_boxes))
            rois = np.append(rois, im_i_rois, axis=0)
            roi_probs = np.append(roi_probs, im_i_probs, axis=0)

第一个输出为RoI，第二个输出为roi_probs。

        outputs[0].reshape(rois.shape)
        outputs[0].data[...] = rois
        if len(outputs) > 1:
            outputs[1].reshape(roi_probs.shape)
            outputs[1].data[...] = roi_probs

proposals_for_one_image

detectron/core/config.py文件指定了Detectron的默认配置选项。我们不应该更改此文件中的值而是写一个配置文件（在yaml中），并使用merge_cfg_from_file(yaml_file)加载它并覆盖默认选项。
tools目录中的大多数工具都使用--cfg选项来指定覆盖文件和覆盖键值对的可选列表：

在tools/{train,test}_net.py中查看使用merge_cfg_from_file的代码。
配置文件的示例参见configs//.yaml。
Detectron支持许多不同的模型类型，每种模型都有很多不同的选项。结果是大量的配置选项。

    def proposals_for_one_image(
        self, im_info, all_anchors, bbox_deltas, scores
    ):
        # Get mode-dependent configuration
        cfg_key = 'TRAIN' if self._train else 'TEST'
        pre_nms_topN = cfg[cfg_key].RPN_PRE_NMS_TOP_N
        post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N
        nms_thresh = cfg[cfg_key].RPN_NMS_THRESH
        min_size = cfg[cfg_key].RPN_MIN_SIZE

转置并重塑预测的bbox变换，以使它们与锚点的顺序相同：

来自conv输出的边界框变化量是(4 * A, H, W)格式
转置为(H, W, 4 * A)
重塑为(H * W * A, 4)，其中行以(H, W, A)从最慢到最快的顺序排列，以匹配枚举的锚点

        # Transpose and reshape predicted bbox transformations to get them
        # into the same order as the anchors:
        #   - bbox deltas will be (4 * A, H, W) format from conv output
        #   - transpose to (H, W, 4 * A)
        #   - reshape to (H * W * A, 4) where rows are ordered by (H, W, A)
        #     in slowest to fastest order to match the enumerated anchors
        bbox_deltas = bbox_deltas.transpose((1, 2, 0)).reshape((-1, 4))

分数相同：

来自conv输出的分数是(A, H, W)格式
转置为(H, W, A)
重塑为(H * W * A, 1)，其中行由（H，W，A）排序以匹配锚点和bbox_deltas的顺序

        # Same story for the scores:
        #   - scores are (A, H, W) format from conv output
        #   - transpose to (H, W, A)
        #   - reshape to (H * W * A, 1) where rows are ordered by (H, W, A)
        #     to match the order of anchors and bbox_deltas
        scores = scores.transpose((1, 2, 0)).reshape((-1, 1))

numpy.squeeze从数组的形状中移除一维条目。
numpy.argsort返回数组排序后的索引。
numpy.argpartition使用kind关键字指定的算法沿给定轴执行间接分区。它以分区顺序返回与给定轴上的索引数据具有相同形状的索引数组。

如果参数pre_nms_topN无效，直接排序；否则，为避免大数组排序选取前pre_nms_topN高的分数再进行排序。根据得到的索引取出bbox_deltas、all_anchors 和scores。

        # 4. sort all (proposal, score) pairs by score from highest to lowest
        # 5. take top pre_nms_topN (e.g. 6000)
        if pre_nms_topN <= 0 or pre_nms_topN >= len(scores):
            order = np.argsort(-scores.squeeze())
        else:
            # Avoid sorting possibly large arrays; First partition to get top K
            # unsorted and then sort just those (~20x faster for 200k scores)
            inds = np.argpartition(
                -scores.squeeze(), pre_nms_topN
            )[:pre_nms_topN]
            order = np.argsort(-scores[inds].squeeze())
            order = inds[order]
        bbox_deltas = bbox_deltas[order, :]
        all_anchors = all_anchors[order, :]
        scores = scores[order]

bbox_transform是使用边界框回归增量将提议框映射到预测框的正向变换。有关权值参数的说明，请参阅bbox_transform_inv。

        # Transform anchors into proposals via bbox transformations
        proposals = box_utils.bbox_transform(
            all_anchors, bbox_deltas, (1.0, 1.0, 1.0, 1.0))

2.修剪预测框到图像（可能会导致提议的区域为零，将在下一步中删除）。
clip_tiled_boxes修剪框到图像边界内。

        # 2. clip proposals to image (may result in proposals with zero area
        # that will be removed in the next step)
        proposals = box_utils.clip_tiled_boxes(proposals, im_info[:2])

3.移除高度或宽度小于min_size的预测框。

        # 3. remove predicted boxes with either height or width < min_size
        keep = _filter_boxes(proposals, min_size, im_info)
        proposals = proposals[keep, :]
        scores = scores[keep]

6.应用宽松的nms（例如阈值= 0.7）
7.取NMS后的前after_nms_topN个（例如300）
8.返回最高预测（-> RoI顶部）
最终调用的是utils.cython_nms.nms

nms

        # 6. apply loose nms (e.g. threshold = 0.7)
        # 7. take after_nms_topN (e.g. 300)
        # 8. return the top proposals (-> RoIs top)
        if nms_thresh > 0:
            keep = box_utils.nms(np.hstack((proposals, scores)), nms_thresh)
            if post_nms_topN > 0:
                keep = keep[:post_nms_topN]
            proposals = proposals[keep, :]
            scores = scores[keep]
        return proposals, scores

_filter_boxes

numpy.where根据条件返回x或y中的元素；只有一个参数时，输出arry中‘真’值的坐标。
keep记录满足条件的boxes的索引（横坐标）。
im_info的格式为(width, height, scale)。

    # Scale min_size to match image scale
    min_size *= im_info[2]
    ws = boxes[:, 2] - boxes[:, 0] + 1
    hs = boxes[:, 3] - boxes[:, 1] + 1
    x_ctr = boxes[:, 0] + ws / 2.
    y_ctr = boxes[:, 1] + hs / 2.
    keep = np.where(
        (ws >= min_size) & (hs >= min_size) &
        (x_ctr < im_info[1]) & (y_ctr < im_info[0]))[0]
    return keep

GenerateProposalsOp

Caffe2GenerateProposalsOp的C++实现。为 Faster RCNN 生成候选边界框。根据图像得分score、边界框回归结果deltas和预定义的边界框形状anchors为图像列表生成提议。采用贪婪的非最大抑制生成最终的边界框。参考：detectron/lib/ops/generate_proposals.py

template <class Context>
class GenerateProposalsOp final : public Operator<Context> {
 public:
  USE_OPERATOR_CONTEXT_FUNCTIONS;
  GenerateProposalsOp(const OperatorDef& operator_def, Workspace* ws)
      : Operator<Context>(operator_def, ws),
        spatial_scale_(
            OperatorBase::GetSingleArgument<float>("spatial_scale", 1.0 / 16)),
        feat_stride_(1.0 / spatial_scale_),
        rpn_pre_nms_topN_(
            OperatorBase::GetSingleArgument<int>("pre_nms_topN", 6000)),
        rpn_post_nms_topN_(
            OperatorBase::GetSingleArgument<int>("post_nms_topN", 300)),
        rpn_nms_thresh_(
            OperatorBase::GetSingleArgument<float>("nms_thresh", 0.7f)),
        rpn_min_size_(OperatorBase::GetSingleArgument<float>("min_size", 16)),
        correct_transform_coords_(OperatorBase::GetSingleArgument<bool>(
            "correct_transform_coords",
            false)) {}

  ~GenerateProposalsOp() {}

  bool RunOnDevice() override;

  // Generate bounding box proposals for a given image
  // im_info: [height, width, im_scale]
  // all_anchors: (H * W * A, 4)
  // bbox_deltas_tensor: (4 * A, H, W)
  // scores_tensor: (A, H, W)
  // out_boxes: (n, 5)
  // out_probs: n
  void ProposalsForOneImage(
      const Eigen::Array3f& im_info,
      const Eigen::Map<const ERMatXf>& all_anchors,
      const utils::ConstTensorView<float>& bbox_deltas_tensor,
      const utils::ConstTensorView<float>& scores_tensor,
      ERArrXXf* out_boxes,
      EArrXf* out_probs) const;

 protected:
  // spatial_scale_ must be declared before feat_stride_
  float spatial_scale_{1.0};
  float feat_stride_{1.0};

  // RPN_PRE_NMS_TOP_N
  int rpn_pre_nms_topN_{6000};
  // RPN_POST_NMS_TOP_N
  int rpn_post_nms_topN_{300};
  // RPN_NMS_THRESH
  float rpn_nms_thresh_{0.7};
  // RPN_MIN_SIZE
  float rpn_min_size_{16};
  // Correct bounding box transform coordates, see bbox_transform() in boxes.py
  // Set to true to match the detectron code, set to false for backward
  // compatibility
  bool correct_transform_coords_{false};
};

GenerateProposalsOp::RunOnDevice()

4个输入分别为分数、边框增量、图像信息和锚点。
输出分别为RoI及相应概率。

  const auto& scores = Input(0);
  const auto& bbox_deltas = Input(1);
  const auto& im_info_tensor = Input(2);
  const auto& anchors = Input(3);
  auto* out_rois = Output(0);
  auto* out_rois_probs = Output(1);

检查scores并获得其维度信息。

  CAFFE_ENFORCE_EQ(scores.ndim(), 4, scores.ndim());
  CAFFE_ENFORCE(scores.template IsType<float>(), scores.meta().name());
  const auto num_images = scores.dim(0);
  const auto A = scores.dim(1);
  const auto height = scores.dim(2);
  const auto width = scores.dim(3);
  const auto K = height * width;

bbox_deltas维度为(num_images, A * 4, H, W)

  // bbox_deltas: (num_images, A * 4, H, W)
  CAFFE_ENFORCE_EQ(
      bbox_deltas.dims(), (vector<TIndex>{num_images, 4 * A, height, width}));

anchors维度为(A, 4)

  // anchors: (A, 4)
  CAFFE_ENFORCE_EQ(anchors.dims(), (vector<TIndex>{A, 4}));
  CAFFE_ENFORCE(anchors.template IsType<float>(), anchors.meta().name());

将anchors广播到每个点。

  // Broadcast the anchors to all pixels
  auto all_anchors_vec =
      utils::ComputeAllAnchors(anchors, height, width, feat_stride_);
  Eigen::Map<const ERMatXf> all_anchors(all_anchors_vec.data(), K * A, 4);

Eigen::Map 是映射现有数据数组的矩阵或向量表达式。

  Eigen::Map<const ERArrXXf> im_info(
      im_info_tensor.data<float>(),
      im_info_tensor.dim(0),
      im_info_tensor.dim(1));

设置输出的形状。

  const int roi_col_count = 5;
  out_rois->Resize(0, roi_col_count);
  out_rois_probs->Resize(0);

Array 类提供通用数组，而 Matrix 类则用于线性代数。此外，Array 类提供了一种简单的方法来执行系数运算，这可能没有线性代数意义，例如为数组中的每个系数添加一个常数或者两个系数数组的乘法。
对于每一张图片，取出相应的图像信息、边界框变化量和分数

  std::vector<ERArrXXf> im_boxes(num_images);
  std::vector<EArrXf> im_probs(num_images);
  for (int i = 0; i < num_images; i++) {
    auto cur_im_info = im_info.row(i);
    auto cur_bbox_deltas = GetSubTensorView<float>(bbox_deltas, i);
    auto cur_scores = GetSubTensorView<float>(scores, i);

调用ProposalsForOneImage 获得预测框和概率。

    ERArrXXf& im_i_boxes = im_boxes[i];
    EArrXf& im_i_probs = im_probs[i];
    ProposalsForOneImage(
        cur_im_info,
        all_anchors,
        cur_bbox_deltas,
        cur_scores,
        &im_i_boxes,
        &im_i_probs);
  }

计算 RoI 的总数。
为什么使用 Extend？

  int roi_counts = 0;
  for (int i = 0; i < num_images; i++) {
    roi_counts += im_boxes[i].rows();
  }
  out_rois->Extend(roi_counts, 50, &context_);
  out_rois_probs->Extend(roi_counts, 50, &context_);
  float* out_rois_ptr = out_rois->mutable_data<float>();
  float* out_rois_probs_ptr = out_rois_probs->mutable_data<float>();
  for (int i = 0; i < num_images; i++) {
    const ERArrXXf& im_i_boxes = im_boxes[i];
    const EArrXf& im_i_probs = im_probs[i];
    int csz = im_i_boxes.rows();

out_rois的内存映射到cur_rois。设置对应图像索引，保存预测框坐标。

    // write rois
    Eigen::Map<ERArrXXf> cur_rois(out_rois_ptr, csz, 5);
    cur_rois.col(0).setConstant(i);
    cur_rois.block(0, 1, csz, 4) = im_i_boxes;

保存预测值。

    // write rois_probs
    Eigen::Map<EArrXf>(out_rois_probs_ptr, csz) = im_i_probs;

    out_rois_ptr += csz * roi_col_count;
    out_rois_probs_ptr += csz;
  }

GetSubTensorView

使用tensor中的数据指针从tensor获取子张量视图。

  DCHECK_EQ(tensor.meta().itemsize(), sizeof(T));

  if (tensor.size() == 0) {
    return utils::ConstTensorView<T>(nullptr, {});
  }

  std::vector<int> start_dims(tensor.ndim(), 0);
  start_dims.at(0) = dim0_start_index;
  auto st_idx = ComputeStartIndex(tensor, start_dims);
  auto ptr = tensor.data<T>() + st_idx;

  auto& input_dims = tensor.dims();
  std::vector<int> ret_dims(input_dims.begin() + 1, input_dims.end());

  utils::ConstTensorView<T> ret(ptr, ret_dims);
  return ret;

GenerateProposalsOp::ProposalsForOneImage

转置并重塑预测的 bbox 变换，以使它们与锚点的顺序相同：

来自 conv 输出的边界框变化量是(4 * A, H, W)格式
转置为(H, W, 4 * A)
重塑为(H * W * A, 4)，其中行以(H, W, A)从最慢到最快的顺序排列，以匹配枚举的锚点

  // Transpose and reshape predicted bbox transformations to get them
  // into the same order as the anchors:
  //   - bbox deltas will be (4 * A, H, W) format from conv output
  //   - transpose to (H, W, 4 * A)
  //   - reshape to (H * W * A, 4) where rows are ordered by (H, W, A)
  //     in slowest to fastest order to match the enumerated anchors
  CAFFE_ENFORCE_EQ(bbox_deltas_tensor.ndim(), 3);
  CAFFE_ENFORCE_EQ(bbox_deltas_tensor.dim(0) % 4, 0);
  auto A = bbox_deltas_tensor.dim(0) / 4;
  auto H = bbox_deltas_tensor.dim(1);
  auto W = bbox_deltas_tensor.dim(2);

将bbox_deltas中的数据映射到 Eigen::Map，调用 Eigen::Transpose 转置得到bbox_deltas。

  // equivalent to python code
  //  bbox_deltas = bbox_deltas.transpose((1, 2, 0)).reshape((-1, 4))
  ERArrXXf bbox_deltas(H * W * A, 4);
  Eigen::Map<ERMatXf>(bbox_deltas.data(), H * W, 4 * A) =
      Eigen::Map<const ERMatXf>(bbox_deltas_tensor.data(), A * 4, H * W)
          .transpose();
  CAFFE_ENFORCE_EQ(bbox_deltas.rows(), all_anchors.rows());

scores的处理与之类似。

  // - scores are (A, H, W) format from conv output
  // - transpose to (H, W, A)
  // - reshape to (H * W * A, 1) where rows are ordered by (H, W, A)
  //   to match the order of anchors and bbox_deltas
  CAFFE_ENFORCE_EQ(scores_tensor.ndim(), 3);
  CAFFE_ENFORCE_EQ(scores_tensor.dims(), (vector<int>{A, H, W}));
  // equivalent to python code
  // scores = scores.transpose((1, 2, 0)).reshape((-1, 1))
  EArrXf scores(scores_tensor.size());
  Eigen::Map<ERMatXf>(scores.data(), H * W, A) =
      Eigen::Map<const ERMatXf>(scores_tensor.data(), A, H * W).transpose();

std::iota 构造scores的索引，然后根据scores从大到小排序。std::partial_sort对部分元素进行排序，但是起于C++17。

  std::vector<int> order(scores.size());
  std::iota(order.begin(), order.end(), 0);
  if (rpn_pre_nms_topN_ <= 0 || rpn_pre_nms_topN_ >= scores.size()) {
    // 4. sort all (proposal, score) pairs by score from highest to lowest
    // 5. take top pre_nms_topN (e.g. 6000)
    std::sort(order.begin(), order.end(), [&scores](int lhs, int rhs) {
      return scores[lhs] > scores[rhs];
    });
  } else {
    // Avoid sorting possibly large arrays; First partition to get top K
    // unsorted and then sort just those (~20x faster for 200k scores)
    std::partial_sort(
        order.begin(),
        order.begin() + rpn_pre_nms_topN_,
        order.end(),
        [&scores](int lhs, int rhs) { return scores[lhs] > scores[rhs]; });
    order.resize(rpn_pre_nms_topN_);
  }

GetSubArray 根据排序后的索引得到排序后的结果。

  ERArrXXf bbox_deltas_sorted;
  ERArrXXf all_anchors_sorted;
  EArrXf scores_sorted;
  utils::GetSubArrayRows(
      bbox_deltas, utils::AsEArrXt(order), &bbox_deltas_sorted);
  utils::GetSubArrayRows(
      all_anchors.array(), utils::AsEArrXt(order), &all_anchors_sorted);
  utils::GetSubArray(scores, utils::AsEArrXt(order), &scores_sorted);

bbox_transform 通过边界框变换将锚点转为提议。

  // Transform anchors into proposals via bbox transformations
  static const std::vector<float> bbox_weights{1.0, 1.0, 1.0, 1.0};
  auto proposals = utils::bbox_transform(
      all_anchors_sorted,
      bbox_deltas_sorted,
      bbox_weights,
      utils::BBOX_XFORM_CLIP_DEFAULT,
      correct_transform_coords_);

2.修剪提案到图像内（可能会导致提议的区域为零，将在下一步中删除）

  // 2. clip proposals to image (may result in proposals with zero area
  // that will be removed in the next step)
  proposals = utils::clip_boxes(proposals, im_info[0], im_info[1]);

3.移除高度或宽度< min_size的预测框

  // 3. remove predicted boxes with either height or width < min_size
  auto keep = utils::filter_boxes(proposals, min_size, im_info);
  DCHECK_LE(keep.size(), scores_sorted.size());

6.应用宽松的nms（例如阈值= 0.7）
7.take after_ms_topN（例如300）
8.返回最高建议（-> RoI顶部）

  // 6. apply loose nms (e.g. threshold = 0.7)
  // 7. take after_nms_topN (e.g. 300)
  // 8. return the top proposals (-> RoIs top)
  if (post_nms_topN > 0 && post_nms_topN < keep.size()) {
    keep = utils::nms_cpu(
        proposals, scores_sorted, keep, nms_thresh, post_nms_topN);
  } else {
    keep = utils::nms_cpu(proposals, scores_sorted, keep, nms_thresh);
  }

生成输出

  // Generate outputs
  utils::GetSubArrayRows(proposals, utils::AsEArrXt(keep), out_boxes);
  utils::GetSubArray(scores_sorted, utils::AsEArrXt(keep), out_probs);

bbox_transform

BBoxTransform 使用边界框将提议边界框转换为目标边界框，bbox_transform 的cpp实现。

使用边界框回归增量将提议框映射到真实框的正向变换。
box：边界框像素坐标，形状为（M，4），格式为[x1; y1; x2; y2]，其中x2> = x1，y2> = y1
deltas：边界框平移和尺度，形状为（M，4），格式为[dx; dy; dw; dh]
dx，dy：边界框中心的尺度不变的平移
dw，dh：对数空间的边界框宽度和高度
weights：deltas的权重[wx，wy，ww，wh]
bbox_xform_clip：变换后对数空间中的最小边界框宽度和高度
correct_transform_coords：正确的边界框变换坐标。设置为true以匹配检测器代码，设置为false以实现后向兼容性
返回值：边界框的像素坐标，形状为（M，4），格式[x1; y1; x2; y2]有关更多详细信息，请参阅“Rich feature hierarchies for accurate object detection and semantic segmentation”附录C.
参考：detectron/lib/utils/boxes.py bbox_transform()

如果boxes 为空，则返回一个空矩阵。

  using T = typename Derived1::Scalar;
  using EArrXX = EArrXXt<T>;
  using EArrX = EArrXt<T>;

  if (boxes.rows() == 0) {
    return EArrXX::Zero(T(0), deltas.cols());
  }

检查boxes和deltas的维度。

  CAFFE_ENFORCE_EQ(boxes.rows(), deltas.rows());
  CAFFE_ENFORCE_EQ(boxes.cols(), 4);
  CAFFE_ENFORCE_EQ(deltas.cols(), 4);

获取boxes的长宽及中心坐标。

  EArrX widths = boxes.col(2) - boxes.col(0) + T(1.0);
  EArrX heights = boxes.col(3) - boxes.col(1) + T(1.0);
  auto ctr_x = boxes.col(0) + T(0.5) * widths;
  auto ctr_y = boxes.col(1) + T(0.5) * heights;

cwiseMin求两组数据的元素间的最小值。

  auto dx = deltas.col(0).template cast<T>() / weights[0];
  auto dy = deltas.col(1).template cast<T>() / weights[1];
  auto dw =
      (deltas.col(2).template cast<T>() / weights[2]).cwiseMin(bbox_xform_clip);
  auto dh =
      (deltas.col(3).template cast<T>() / weights[3]).cwiseMin(bbox_xform_clip);

计算预测框的中心坐标和长宽。

  EArrX pred_ctr_x = dx * widths + ctr_x;
  EArrX pred_ctr_y = dy * heights + ctr_y;
  EArrX pred_w = dw.exp() * widths;
  EArrX pred_h = dh.exp() * heights;

再次切换坐标表示方法。

  T offset(correct_transform_coords ? 1.0 : 0.0);

  EArrXX pred_boxes = EArrXX::Zero(deltas.rows(), deltas.cols());
  // x1
  pred_boxes.col(0) = pred_ctr_x - T(0.5) * pred_w;
  // y1
  pred_boxes.col(1) = pred_ctr_y - T(0.5) * pred_h;
  // x2
  pred_boxes.col(2) = pred_ctr_x + T(0.5) * pred_w - offset;
  // y2
  pred_boxes.col(3) = pred_ctr_y + T(0.5) * pred_h - offset;

  return pred_boxes;

clip_boxes

  CAFFE_ENFORCE_EQ(boxes.cols(), 4);

  EArrXXt<typename Derived::Scalar> ret(boxes.rows(), boxes.cols());

  // x1 >= 0 && x1 < width
  ret.col(0) = boxes.col(0).cwiseMin(width - 1).cwiseMax(0);
  // y1 >= 0 && y1 < height
  ret.col(1) = boxes.col(1).cwiseMin(height - 1).cwiseMax(0);
  // x2 >= 0 && x2 < width
  ret.col(2) = boxes.col(2).cwiseMin(width - 1).cwiseMax(0);
  // y2 >= 0 && y2 < height
  ret.col(3) = boxes.col(3).cwiseMin(height - 1).cwiseMax(0);

  return ret;

filter_boxes

  CAFFE_ENFORCE_EQ(boxes.cols(), 4);

  // Scale min_size to match image scale
  min_size *= im_info[2];

  using T = typename Derived::Scalar;
  using EArrX = EArrXt<T>;

  EArrX ws = boxes.col(2) - boxes.col(0) + T(1);
  EArrX hs = boxes.col(3) - boxes.col(1) + T(1);
  EArrX x_ctr = boxes.col(0) + ws / T(2);
  EArrX y_ctr = boxes.col(1) + hs / T(2);

  EArrXb keep = (ws >= min_size) && (hs >= min_size) &&
      (x_ctr < T(im_info[1])) && (y_ctr < T(im_info[0]));

  return GetArrayIndices(keep);

nms_cpu

对候选框进行贪婪非极大值抑制。若边界框间的交并比（IoU）大于阈值，则选定较高得分的边界框而丢弃其他的。
参考：Detectron/detectron/utils/cython_nms.pyx
proposals：建议框的像素坐标，形状为（M，4），格式：[x1;Y1;X2;Y2]
scores：每个边界框的得分，形状为（M，1）
sorted_indices：将分数从高到低排序的索引
return：所选提案的行索引

检查输入的形状。

  CAFFE_ENFORCE_EQ(proposals.rows(), scores.rows());
  CAFFE_ENFORCE_EQ(proposals.cols(), 4);
  CAFFE_ENFORCE_EQ(scores.cols(), 1);
  CAFFE_ENFORCE_LE(sorted_indices.size(), proposals.rows());

取出proposals每一列的数据，计算建议框的面积。

  using EArrX = EArrXt<typename Derived1::Scalar>;

  auto x1 = proposals.col(0);
  auto y1 = proposals.col(1);
  auto x2 = proposals.col(2);
  auto y2 = proposals.col(3);

  EArrX areas = (x2 - x1 + 1.0) * (y2 - y1 + 1.0);

AsEArrXt借助Eigen::Map将vector中的数据映射到ERMatXt。

  EArrXi order = AsEArrXt(sorted_indices);

似乎没有必要每次都检查topN >= 0。默认值改成 std::numeric_limits<int>::max()？

  std::vector<int> keep;
  int ci = 0;
  while (order.size() > 0) {
    // exit if already enough proposals
    if (topN >= 0 && keep.size() >= topN) {
      break;
    }

ConstEigenVectorArrayMap是一维常量数组。
取出第一个索引保存到keep。
xx1、yy1、xx2、yy2为其余框与得分最高框重叠矩形的坐标。

    int i = order[0];
    keep.push_back(i);
    ConstEigenVectorArrayMap<int> rest_indices(
        order.data() + 1, order.size() - 1);
    EArrX xx1 = GetSubArray(x1, rest_indices).cwiseMax(x1[i]);
    EArrX yy1 = GetSubArray(y1, rest_indices).cwiseMax(y1[i]);
    EArrX xx2 = GetSubArray(x2, rest_indices).cwiseMin(x2[i]);
    EArrX yy2 = GetSubArray(y2, rest_indices).cwiseMin(y2[i]);

    EArrX w = (xx2 - xx1 + 1.0).cwiseMax(0.0);
    EArrX h = (yy2 - yy1 + 1.0).cwiseMax(0.0);
    EArrX inter = w * h;
    EArrX ovr = inter / (areas[i] + GetSubArray(areas, rest_indices) - inter);

GetArrayIndices为评估为true的元素返回1d数组的索引。
inds中的索引+1才能对应到order。将过滤后的索引赋值给order，开启下一轮筛选。

    // indices for sub array order[1:n]
    auto inds = GetArrayIndices(ovr <= thresh);
    order = GetSubArray(order, AsEArrXt(inds) + 1);
  }

  return keep;