libtorch学习笔记（16）- Faste-RCNN的RPN如何训练以及训练的参数集

最新推荐文章于 2025-05-23 13:19:22 发布

王飞95

最新推荐文章于 2025-05-23 13:19:22 发布

阅读量1.3k

点赞数 3

CC 4.0 BY-SA版权

分类专栏： libtorch 笔记文章标签：深度学习神经网络概率论

本文链接：https://blog.csdn.net/defi_wang/article/details/108558141

笔记同时被 2 个专栏收录

20 篇文章

订阅专栏

libtorch

14 篇文章

订阅专栏

本文深入解析Faster-RCNN中的RPN网络原理及其实现细节，包括前景背景分类与边界框回归的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习小结

通过前面的学习，很多基本概念基本上已经理解；也通过标准的C/C++函数和Windows自带的D2D技术实现了MNIST, CIFAR10, CIFAR100, Image Folder数据集加载，和图像到张量的转换（当然也支持常见的转换，比如Center Crop, Random Crop, Flip Horizontal, Padding Scale…等等转换）；同时自己写的网络加载器，也能同时定义和加载VGG和RESNET等各类网络，并且在MNIST，CIFAR和猫狗训练集上训练之后的准确率也能还可以，比如VGG16在猫狗训练集上准确率达到了85+%， RESNET34在CIFAR10上准确率达到了94+%，基本上验证了之前的实现没有太大问题。

但是在现实世界中要把之前的这些实现用起来，还有很多工作要做，对于计算机视觉领域而言，图像中对象（包括人，动物，实物和文字等）的检测则是这个领域的基础。Faster-RCNN是目前比较流行的object detection方法之一，这方面的文章已经太多，下面就对其中一个细节做一些解释。

Faster-RCNN

在这里插入图片描述
对于这张图，conv layers比较好理解，就是前面提到的VGG16, ResNet等，一般称为backbone，一般这个网络单独训练，并且在此应用的是预训练网络，比较难理解的是RPN网络，很多文章和视频其实都没有把这个问题点开，反正我是理解了一段时间才明白。

conv layer用已经训练好的参数，对输入图像进行卷积处理，得到feature maps实际上包含了各种用于准确分类的数据，而且选取其中一个区域的数据送入全连接层，只要此区域有训练过的对象，都可以识别出来的。所以RPN用了一个比较巧妙的设计，对这些数据再进一步的训练得到想要得到的信息，这是RPN的内部结构
在这里插入图片描述
第一眼看这个图是不是感觉有点懵逼，在其他的文章虽然也做了详细描述，但是一直不明白到底在训练什么，损失函数怎么定义？

前景和背景分类网络

先看上面一个分支，别的文章已经做过描述，这是一个用于识别anchor的前景和背景分数或者概率的网络，conv3/512后面接入conv1/18（2 scores x 9 anchor area = 18），这个网络分支主要就是要训练这个conv1/18的里面18个卷积核参数，对于训练好的这个网络分支，它能从feature map上算是某个anchor是前景和背景的分数或者分数，为什么呢？这是因为特征图上已经有详细的前景和背景特征信息，毕竟是用预训练好的conv layers计算出来，根据ground_truth标签的信息，就能把这个卷积核参数训练好，将前景和背景信息分数分析出来，然后再通过Softmax进行排列，找出需要前景和背景anchors。

数据流程图：
$[batch_size,18,H,W]→reshape_before_softmax→[batch_size,2,H∗9,W]→softmax([batch_size,2,H∗9,W],1)→[batch_size,1,H∗9,W]（possiblity)/[batch_size,1,H∗9,W)(背景还是前景)→Reshape_after_Softmax→[batch_size,9,H,W](possiblity)/[batch_size,9,H,W)(背景还是前景) [batch\_size, 18, H, W] \to reshape\_before\_softmax \to [batch\_size, 2, H*9, W] \\ \to softmax([batch\_size, 2, H*9, W], 1) \to \\ [batch\_size, 1, H*9, W]（possiblity)/[batch\_size, 1, H*9, W)(背景还是前景) \to \\ Reshape\_after\_Softmax \to \\ [batch\_size, 9, H, W] (possiblity)/[batch\_size, 9, H, W)(背景还是前景)$
Softmax出来的一个tuple，第一个元素是背景或者前景的概率，第二个元素，则是背景还是前景（比如0：背景，1：前景）；如果用sigmoid，则是直接得到一个二分概率，比如>0.5，前景；<=0.5, 背景。总的来说，最后得到的就是么个anchor对的分类（前景/背景）。

用于训练的Label/Ground_Truth:
这个应该不难理解，在训练集中，比如VOC中的xml有记录每个object的bound-box位置，

<object>
    <name>dog</name>
    <pose>Left</pose>
    <truncated>1</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>48</xmin>
        <ymin>240</ymin>
        <xmax>195</xmax>
        <ymax>371</ymax>
    </bndbox>
</object>

可以从此推导出，是前景还是背景，从而SoftMax输出结果进行交叉熵损失函数处理，再进行反向传播更新conv1/18参数和偏置量。 $并且最终训练出来的RPNconv1/18能够通过SoftMax识别区前景和背景\color{red}并且最终训练出来的{RPNconv_{1/18}}能够通过SoftMax识别区前景和背景$ 。

Bounding Box调整

再看上图中下面的一个分支，这是一个用于bounding box回归，对anchor box进行第一次调整。说白了就是用卷积网络来训练物体所在位置和大小，这个的确比较神奇的地方，一般卷积网络用来分类，并以此为目标进行训练，这里居然对某一个Object进行位置和大小进行预测和训练，其实你把每个要训练出来的物体位置和大小当作一个类别，就比较好理解了。这点可以理解下R-CNN中的这段话：

This is similar in spirit to the bounding-box regression used in deformable part models [17]. The primary difference between the two approaches is that here we regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part locations.

但是位置（比如：([0, 800）,[0,600])，实际会比这个要小一些，因为基于anchor-box的缘故)和大小组合(比如, ([0-800], [0-600]))太多了，这得有多少类别啊？所以在R-CNN中就提出的尺度变换的方法解决类别过多的问题，首先用bbox的中心位置 $P_x, P_y)$ 代替左上角坐标，这样一来其位置范围只有原来的一半，大小则为中心点到边框的距离，范围还只有原来的一半，这还不够，这里它要学习的是anchor_box做一个怎样的转换，会得到一个接近于目标bbox的转换参数，比如，对于800x600的图片，考虑到最坏情况(w:800/2, h:600/2)到(w:0, h:0)， $(dw(P),dh(P)→(ln⁡(400),ln⁡(300))→(5.991,5.704)(d_w(P), d_h(P) \to (\ln(400), \ln(300)) \to (5.991, 5.704)$ ，学习的范围极大的缩小了，这样一来学习的收敛速度就会很快，
$Gground_truthboundingbox=(Gx,Gy,Gw,Gh)Pproposal_box=(Px,Py,Pw,Ph)G^predictedground_truthbox=(G^x,G^y,G^w,G^h) \begin{aligned} G_{ground\_truth bounding box} &= (G_x, G_y, G_w, G_h)\\ P_{proposal\_box} &= (P_x, P_y, P_w, P_h) \\ \hat{G}_{predicted ground\_truth box} &= ({\hat{G}_x, \hat{G}_y, \hat{G}_w, \hat{G}_h)}\\ \end{aligned}$
这里proposal box可以理解为选中的anchor box，通过Conv $_{1/36}$ 能得到 $d_*(P)$ (where * is one of x; y; h;w)
$G^x=Pwdx(P)+PxG^y=Phdy(P)+PyG^w=Pwexp(dw(P))G^h=Phexp(dh(P)) \begin{aligned} \hat{G}_x &= P_wd_x(P) + P_x\\ \hat{G}_y &= P_hd_y(P) + P_y\\ \hat{G}_w &= P_wexp(d_w(P))\\ \hat{G}_h &= P_hexp(d_h(P)) \end{aligned}$
bbox中心点则是进行尺度不变转换，尽量保证中心点的准确；高度和宽度则是进行指数转换，按照上面的说法这样做的好处就是收敛快，不需要花大量时间学习。这是torchvision中与之对应代码：

    def decode_single(self, rel_codes, boxes):
        """
        From a set of original boxes and encoded relative box offsets,
        get the decoded boxes.

        Arguments:
            rel_codes (Tensor): encoded boxes
            boxes (Tensor): reference boxes.
        """

        boxes = boxes.to(rel_codes.dtype)

        widths = boxes[:, 2] - boxes[:, 0]
        heights = boxes[:, 3] - boxes[:, 1]
        ctr_x = boxes[:, 0] + 0.5 * widths
        ctr_y = boxes[:, 1] + 0.5 * heights

        wx, wy, ww, wh = self.weights
        dx = rel_codes[:, 0::4] / wx
        dy = rel_codes[:, 1::4] / wy
        dw = rel_codes[:, 2::4] / ww
        dh = rel_codes[:, 3::4] / wh

        # Prevent sending too large values into torch.exp()
        dw = torch.clamp(dw, max=self.bbox_xform_clip)
        dh = torch.clamp(dh, max=self.bbox_xform_clip)

        pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
        pred_w = torch.exp(dw) * widths[:, None]
        pred_h = torch.exp(dh) * heights[:, None]

        pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)
        return pred_boxes

这里boxes是产生的RPN产生的anchor box， rel_codes就是此路网络Conv $_{1/36}$ 的输出。

class RegionProposalNetwork(torch.nn.Module):
	......
    def forward(self,
                images,       # type: ImageList
                features,     # type: Dict[str, Tensor]
                targets=None  # type: Optional[List[Dict[str, Tensor]]]
                ):
        # type: (...) -> Tuple[List[Tensor], Dict[str, Tensor]]
        """
        Arguments:
            images (ImageList): images for which we want to compute the predictions
            features (OrderedDict[Tensor]): features computed from the images that are
                used for computing the predictions. Each tensor in the list
                correspond to different feature levels
            targets (List[Dict[Tensor]]): ground-truth boxes present in the image (optional).
                If provided, each element in the dict should contain a field `boxes`,
                with the locations of the ground-truth boxes.

        Returns:
            boxes (List[Tensor]): the predicted boxes from the RPN, one Tensor per
                image.
            losses (Dict[Tensor]): the losses for the model during training. During
                testing, it is an empty dict.
        """
        # RPN uses all feature maps that are available
        features = list(features.values())
        objectness, pred_bbox_deltas = self.head(features)
        anchors = self.anchor_generator(images, features)

self.head就是RPN的head，对应的实现是：

class RPNHead(nn.Module):
    """
    Adds a simple RPN Head with classification and regression heads

    Arguments:
        in_channels (int): number of channels of the input feature
        num_anchors (int): number of anchors to be predicted
    """

    def __init__(self, in_channels, num_anchors):
        super(RPNHead, self).__init__()
        self.conv = nn.Conv2d(
            in_channels, in_channels, kernel_size=3, stride=1, padding=1
        )
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred = nn.Conv2d(
            in_channels, num_anchors * 4, kernel_size=1, stride=1
        )

        for layer in self.children():
            torch.nn.init.normal_(layer.weight, std=0.01)
            torch.nn.init.constant_(layer.bias, 0)
            
    def forward(self, x):
        # type: (List[Tensor]) -> Tuple[List[Tensor], List[Tensor]]
        logits = []
        bbox_reg = []
        for feature in x:
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

Anchor Box如何参与RPN

anchor box主要是通过参与下面的目标函数（Ridge regression）来达到优化训练RPN：
$min⁡w^⋆∑in(t⋆i−w^⋆Tϕ5(Pi))2+λ∣∣w⋆^∣∣2) w_{\star} = \argmin_{\mathclap{\hat{w}_{\star}}}\sum_i^n(t_{\star}^i-\hat{w}_{\star}^T\phi_5(P^i))^2 + \lambda||\hat{w_{\star}}||^2)$
其中P就是anchors， $t⋆t_{\star}$ 就是根据训练集(P, G)计算而来：
$\begin{aligned} t_x &= (G_x - P_x)/P_w\\ t_y &= (G_y - P_y)/P_h\\ t_w &= \log{(G_w/P_w)}\\ t_h &= \log{(G_h/P_h)} \end{aligned}$
作为标准的最小二乘法问题，这是神经网络的强项，所以是可以通过学习来达到训练网络的目的，最后用于预测object区域。