Faster-RCNN 代码阅读笔记(一)

最新推荐文章于 2024-04-11 22:36:44 发布

Tianchao龙虾

最新推荐文章于 2024-04-11 22:36:44 发布

阅读量528

点赞数

分类专栏：论文代码阅读笔记文章标签：深度学习神经网络目标检测

本文链接：https://blog.csdn.net/wuchaohuo724/article/details/119925181

版权

论文代码阅读笔记专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Faster-RCNN 代码阅读笔记(一)

代码链接:https://github.com/chenyuntc/simple-faster-rcnn-pytorch
在这里插入图片描述

在这里插入图片描述

可以看到，网络结构分为三个部分:

Backbone: VGG16
Region Proposal Network
Classfication and Regression

1. Backbone

Faster-RCNN 是以VGG16作为backbone，代码具体如下:

decom_vgg16 代码

def decom_vgg16():
    # the 30th layer of feature is relu of conv5_3
    if opt.caffe_pretrained:
        model = vgg16(pretrained=False)
        if not opt.load_path:
            model.load_state_dict(torch.load(opt.caffe_pretrain_path))
    else:
        model = vgg16(not opt.load_path)

    features = list(model.features)[:30]
    classifier = model.classifier

    classifier  = list(classifier)
    del classifier[6]
    if not opt.use_drop:
        del classifier[5]
        del classifier[2]
    
    classifier = nn.Sequential(*classifier)

    # freeze top4 conv
    for layer in features[:10]:
        for p in layer.parameters():
            p.requires_grad = False
    
    return nn.Sequential(*features), classifier

根据上面的代码，可得到下面的网络结构:

decom_vgg16 代码

(features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
  )
)

VGG-16作为backbone，较为简单。作者删除了Dropout层，且前四层网络层不进行训练。

2. Region Proposal Network

接下来看一下Region proposal network。我们知道，这一层主要是提供感兴趣区域(RoI)，也就是说通过backbone输出的特征，输入到RPN，进行粗略的回归和分类。主要是分为两个分支:

回归
分类

RPN通过最后一个classification 和 regression进行训练的同时，还会提供RoI给Faster-RCNN(RoIHead)作为训练样本。RPN生成RoIs的过程如下:

对于每张图片(论文中说通常1000x600的图像)，利用它的feature map，计算 $H/16 \times (W/16) \times 9$ (大概 $20000\approx 40 \times 60 \times 9$ )个anchor属于前景的概率，以及对应的位置参数。
选取概率较大的12000个anchor。
利用回归的位置参数，修正这12000个anchor的位置，得到RoIs。
利用NMS进行抑制，选出概率最大的2000个RoIs。

注意：在inference的时候，为了提高处理速度，12000和2000分别变为6000和300.

RPN的输出：RoIs（形如2000×4或者300×4的tensor)。
在这里插入图片描述

RegionProposalNetwork 代码

class RegionProposalNetwork(nn.Module):
    def __init__(
    # feat_stride=16 ，因为是经4次pool后提到的特征，故feature map较原图缩小了16倍
        self, in_channels=512, mid_channels=512, ratios=[0.5, 1, 2], 
        anchor_scales=[8, 16, 32], feat_stride=16, 
        proposal_creator_params=dict(),
    ):

        super(RegionProposalNetwork, self).__init__()

        self.anchor_base = generate_anchor_base(
            anchor_scales=anchor_scales, ratios=ratios
        )

        self.feat_stride = feat_stride
        self.proposal_layer = ProposalCreator(self, **proposal_creator_params)
        n_anchor = self.anchor_base.shape[0]

        self.conv1 = nn.Conv2d(in_channels, mid_channels, kernel_size=3, stride=1, padding=1)
        # 2K个概率 决定存在目标
        self.score = nn.Conv2d(mid_channels, n_anchor*2, kernel_size=1, stride=1, padding=0)
        # 4K，目标的四个坐标值
        self.loc   = nn.Conv2d(mid_channels, n_anchor*4, kernel_size=1, stride=1, padding=0)

        normal_init(self.conv1, 0, 0.01)
        normal_init(self.score, 0, 0.01)
        normal_init(self.loc,   0, 0.01)

    
    def forward(self, x, img_size, scale=1.):
        n, _, hh, ww = x.shape
        anchor = _enumerate_shifted_anchor(np.array(self.anchor_base), self.feat_stride, hh, ww) # (anchor_base[0] * hh * ww, 4) 大约2000个左右的anchor

        n_anchor = anchor.shape[0] // (hh * ww) # 9
        h = F.relu(self.conv1(x))  # 3x3 convolution

        rpn_locs = self.loc(h) # regression #[1, 60, 40, 36]

        rpn_locs = rpn_locs.permute(0, 2, 3, 1).contiguous().view(n, -1, 4) # [1, 21600, 4]
        rpn_scores = self.score(h) # is object or not 
        rpn_scores = rpn_scores.permute(0, 2, 3, 1).contiguous() 
        rpn_softmax_scores = F.softmax(rpn_scores.view(n, hh, ww, n_anchor, 2), dim=4) # reshape and softmax
        rpn_fg_scores = rpn_softmax_scores[:, :, :, :, 1].contiguous() 
        rpn_fg_scores = rpn_fg_scores.view(n, -1) # reshape
        rpn_scores = rpn_scores.view(n, -1, 2)

        rois = list()
        roi_indices = list()
        for i in range(n):
            # 將20000个左右的anchor与预测出来的offset进行微调，接近groundtruth,
            # 再用预测出来的foreground分数，和NMS选择出2000个训练样本rois。
            roi = self.proposal_layer(                       # proposal
                rpn_locs[i].cpu().data.numpy(),
                rpn_fg_scores[i].cpu().data.numpy(),
                anchor,
                img_size,
                scale=scale
            )
            batch_index = i * np.ones((len(roi), ), dtype=np.int32)
            rois.append(roi)
            roi_indices.append(batch_index)

        rois = np.concatenate(rois, axis=0)
        roi_indices = np.concatenate(roi_indices, axis=0)
        return rpn_locs, rpn_scores, rois, roi_indices, anchor



def normal_init(m, mean, stddev, truncated=False):
    if truncated:
        m.weight.data.normal_().fmod_(2).mul_(stddev).add_(mean) 
    else:
        m.weight.data.normal_(mean, stddev)
        m.bias.data.zero_()

通过代码可以看出，相对于回归来说，通过1x1卷积之后，输出的通道数为 n_anchor*4，代表着anchor的数量乘以每个anchor的四个坐标。而在这里，feature map上的每个点的anchor数量为9。因此，回归分支输出的通道数为 9*4。

而对于分类来说，这里RPN主要做的是区分每个anchor是否含有目标，也就是图上所示的is object or not。因此，也就是对9个anchor进行二分类，所以输出通道数为n_anchor*2，也就是9*2。

最后输出的两个shape应为: $(1, 36, h h, w w)$ 和 $(1, 18, h h, w w)$ ，值得注意的是，这里输出的大小和输入的feature map大小一致，并没有进行改变，除了通道数。然后经过一系列的reshape操作和softmax，输入到Proposal layer产生RoIs。最终这个RPN网络输出rpn_locs, rpn_scores, rois, roi_indices, anchor，分别对应20000个anchor的位置预测，分数预测，2000个RoIs输出和所有anchor的输出。对于这个RPN网络如何进行训练，后续再探讨。而RPN产生的2000个ROIS并不都用于训练，经过ProposalTargetCreator(后面提及)的筛选产生128个才用于后续VGG16RoIHead的训练。那对于RPN网络，怎么训练的呢？(后面提及)

上面代码有四个值得关注的地方:

ProposalCreator 类
generate_anchor_base 函数
_enumerate_shifted_anchor 函数
AnchorTargetProposal 类

(1) ProposalCreator

目的: 为Fast-RCNN也即检测网络提供2000个训练样本

输入: RPN网络中1*1卷积输出的loc和score，以及20000个anchor坐标，原图尺寸，scale（即对于这张训练图像较其原始大小的scale）

输出: 2000个训练样本rois（只是2000*4的坐标，无ground truth！）

从上面图示也可以看到这个结构，RPN接收image_info的信息，结合上面两路分支的输出，最终输出RoI。 也就是代码中ProposalCreator需要做的事情。所以:

先来看一下ProposalCreator。上面说到，RPN接收image_info的信息，结合上面两路分支的输出，最终输出RoI。那究竟是怎么一个具体操作呢？

ProposalCreator 代码

class ProposalCreator:
    def __init__(self, parent_model, nms_thresh=0.7, n_train_pre_nms=12000, 
                 n_train_post_nms=2000, n_test_pre_nms=6000, n_test_post_nms=300,
                 min_size=16):
        self.parent_model = parent_model
        self.nms_thresh = nms_thresh
        self.n_train_pre_nms = n_train_pre_nms
        self.n_train_post_nms = n_train_post_nms
        self.n_test_pre_nms = n_test_pre_nms
        self.n_test_post_nms = n_test_post_nms
        self.min_size = min_size
    
    def __call__(self, loc, score, anchor, img_size, scale=1.):
        if self.parent_model.training:
            n_pre_nms = self.n_train_pre_nms # 12000
            n_post_nms = self.n_train_post_nms # 经过NMS后有2000个
        else:
            n_pre_nms = self.n_test_pre_nms # 6000
            n_post_nms = self.n_test_post_nms # 经过NMS后有300个
        
        roi = loc2bbox(anchor, loc) # 將anchor与offsets结合，近似于ground truth(也就是RoIs)

        #裁剪将rois的ymin,ymax限定在[0,H]
        roi[:, slice(0, 4, 2)] = np.clip(roi[:, slice(0, 4, 2)], 0, img_size[0])
        #裁剪将rois的xmin,xmax限定在[0,W]
        roi[:, slice(1, 4, 2)] = np.clip(roi[:, slice(1, 4, 2)], 0, img_size[1])

        min_size = self.min_size * scale # 16
        hs = roi[:, 2] - roi[:, 0] #rois的宽
        ws = roi[:, 3] - roi[:, 1]  #rois的长

        keep = np.where((hs >= min_size) & (ws >= min_size)) [0] #确保rois的长宽大于最小阈值
        roi = roi[keep, :] 
        #对剩下的ROIs进行打分（根据region_proposal_network中rois的预测前景概率）
        score = score[keep]

        #将score拉伸并逆序（从高到低）排序
        order = score.ravel().argsort()[::-1]
        if n_pre_nms > 0:
            #train时从20000中取前12000个rois，test取前6000个
            order = order[:n_pre_nms]
        roi = roi[order, :]
        score = score[order]

        #调用非极大值抑制函数，将筛选后ROIS进行返回。经过NMS处理后Train数据集得到2000个框，Test数据集得到300个框
        keep = nms(torch.from_numpy(roi).cuda(), 
                   torch.from_numpy(score).cuda(),
                   self.nms_thresh)
        if n_post_nms > 0:
            keep = keep[:n_post_nms]
        roi = roi[keep.cpu().numpy()]
        return roi

可以看到上面的代码，首先ProposalCreator类定义了__call__函数，使对象可以向函数一样调用。

其次，这个类调用了loc2bbox函数，函数loc2bbox首先利用RPN网络输出的预测值loc来对20000个anchor进行微调。

我们来看一下具体代码:

loc2bbox 代码

```python
def loc2bbox(src_bbox, loc):
    if src_bbox.shape[0] == 0:
        return np.zeros((0, 4), dtype=loc.dtype)
    
    src_bbox = src_bbox.astype(src_bbox.dtype, copy=False)

    src_height = src_bbox[:, 2] - src_bbox[:, 0]
    src_width = src_bbox[:, 3] - src_bbox[:, 1]
    src_ctr_y = src_bbox[:, 0] + 0.5 * src_height
    src_ctr_x = src_bbox[:, 1] + 0.5 * src_width

    dy = loc[:, 0::4]
    dx = loc[:, 1::4]
    dh = loc[:, 2::4]
    dw = loc[:, 3::4]

    ctr_y = dy * src_height[:, np.newaxis] + src_ctr_y[:, np.newaxis]
    ctr_x = dx * src_width[:, np.newaxis] + src_ctr_x[:, np.newaxis]

    h = np.exp(dh) * src_height[:, np.newaxis]
    w = np.exp(dw) * src_width[:, np.newaxis]

    dst_bbox = np.zeros(loc.shape, dtype=loc.dtype)
    dst_bbox[:, 0::4] = ctr_y - 0.5 * h
    dst_bbox[:, 1::4] = ctr_x - 0.5 * w
    dst_bbox[:, 2::4] = ctr_y + 0.5 * h
    dst_bbox[:, 3::4] = ctr_x + 0.5 * w

    return dst_bbox

```

loc2bbox函数输入的是anchor boxes和regression分支输出的loc。它会把anchor boxes和regression分支输出的offsets进行融合。在RCNN的原话是这样的:
After learning these functions, we can transform an input proposal P into a predicted ground-truth box $\hat{G}$ by applying the transformation

所对应的公式为:
$\hat{G}_x = P_wd_x(P) + P_x \\ \hat{G}_y = P_hd_y(P) + P_y \\ \hat{G}_w = P_w \exp(d_w(P))\\ \hat{G}_h = P_h \exp(d_h(P)) \\$

所以，根据上面的式子，我们可以得到调整过后的anchor boxes $\hat{G}$ 。

此时微调后的20000个anchor称之为rois。然后根据原图尺寸，将这些rois进行截断。然后将此时所有roi中所有宽与高皆大于16的roi的索引记录，假设有18000个roi满足。然后利用预测值score对这些roi从高到低排序，只取前12000个。然后利用NMS进一步筛选，得到2000个roi。这也是ProposalCreator类在loc2bbox函数执行完之后所执行的操作。

(2) generate_anchor_base 函数

函数generate_anchor_base实现生成9个base anchor，为什么是base呢，因为对于每个feature map平面中的点，都要以此点为中心生成9个anchor。

generate_anchor_base 代码

def generate_anchor_base(base_size=16, ratios=[0.5, 1, 2], anchor_scales=[8, 6, 32]):
    py = base_size / 2.
    px = base_size / 2.

    anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4), dtype=np.float32) # 以(0, 0, 0, 0)为原点的anchor base

    for i in range(len(ratios)):
        for j in range(len(anchor_scales)):
            h = base_size * anchor_scales[j] * np.sqrt(ratios[i])
            w = base_size * anchor_scales[j] * np.sqrt(1. / ratios[i])

            index = i * len(anchor_scales) + j
            anchor_base[index, 0] = py - h / 2.
            anchor_base[index, 1] = px - w / 2.
            anchor_base[index, 2] = py + h / 2.
            anchor_base[index, 3] = px + w / 2.
    return anchor_base

从代码中可以看到，这个函数是以特征图的左上角为基准产生的9个anchor。anchor有3中不同的ratios和scales，分别为:[0.5, 1, 2]和[8, 6, 32]，而对于base_size，是一个任意定的参数值。我们测试上面函数的输出，结果如下:

anchor_base =  [[ -37.254833  -82.50967    53.254833   98.50967 ]
                [ -82.50967  -173.01933    98.50967   189.01933 ]
                [-173.01933  -354.03867   189.01933   370.03867 ]
                [ -56.        -56.         72.         72.      ]
                [-120.       -120.        136.        136.      ]
                [-248.       -248.        264.        264.      ]
                [ -82.50967   -37.254833   98.50967    53.254833]
                [-173.01933   -82.50967   189.01933    98.50967 ]
                [-354.03867  -173.01933   370.03867   189.01933 ]]

可以看到，上面是一个(9,4)的一个二维数组，分别对应9个不同大小的anchor的4个坐标值。可以由下图具体的看出:

那基于特征图左上角生成的anchor_base，如何对整个feature map的每个点生成对应的9个anchor呢？

(3) _enumerate_shifted_anchor 函数

RegionProposalNetwork的代码中含有以下两句:

n, _, hh, ww = x.shape
anchor = _enumerate_shifted_anchor(np.array(self.anchor_base), self.feat_stride, hh, ww)

x 是特征图，因此，hh和ww分别对应特征图的高和宽。且调用了一个函数_enumerate_shifted_anchor，也就是基于上面所说的anchor_base推广到整个feature map(实际上是推广到整个原图像，后续会说)。 下面来看一下这个函数:

_enumerate_shifted_anchor 代码

# 利用base anchor生成所有对应feature map的anchor
def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width):
    # 纵向偏移量（0，16，32，...）  
    shift_y = np.arange(0, height * feat_stride, feat_stride) 
    # 横向偏移量（0，16，32，...）
    shift_x = np.arange(0, width * feat_stride, feat_stride)
    shift_x, shift_y = np.meshgrid(shift_x, shift_y)
    shift = np.stack((shift_y.ravel(), shift_x.ravel(),
                        shift_y.ravel(), shift_x.ravel()), axis=1)
    
    A = anchor_base.shape[0] # 9
    K = shift.shape[0]  # K = hh*ww
    anchor = anchor_base.reshape((1, A, 4)) + shift.reshape((1, K, 4)).transpose((1, 0, 2))
    anchor = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor # 返回（K，4），所有anchor的坐标

我们测试上面函数的输出，假设特征图的大小为 $40 \times 60$ ，feat_stride为16，结果如下:

proposal_layer = _enumerate_shifted_anchor(np.array(anchor_base), feat_stride=16, height=40, width=60)

proposal_layer = [[ -37.254833  -82.50967    53.254833   98.50967 ]
                  [ -82.50967  -173.01933    98.50967   189.01933 ]
                  [-173.01933  -354.03867   189.01933   370.03867 ]
                  ...
                  [ 541.49036   906.7452    722.50964   997.2548  ]
                  [ 450.98065   861.49036   813.01935  1042.5096  ]
                  [ 269.96133   770.98065   994.0387   1133.0193  ]]

可以看到，这个函数输入的是anchor_base矩阵，feat_stride(特征图相对于原图像的下采样比例，这里是16)，特征图的高hh和宽ww。可以从函数的一开始知道，它首先生成横向与纵向的偏移量，且将特征图的每一个点放大16倍到原图。因此，这也是为什么说anchor最终生成是基于原图像的，而anchor_base是基于feature map的。 并且，可以看到上面生成的anchor_base大小都比特征图 $40 \times 60$ 大，因此是相对于原始大图像而设置的这9种组合的尺寸，这些尺寸基本上可以包含图像中的任何物体，如果画面里出现了特大的物体，则这个scale就要相应的再调整大一点，来包含特大的物体。但是，并不是说原图像所有的点都会生成anchor，而是相隔了16个像素点。 因为feature map是下采样16倍得到的，返回去所对应的每个点应相隔16。

(4) AnchorTargetProposal类:

目的：利用每张图中bbox的anchor来分配ground truth

输入：最初生成的20000个anchor坐标、此一张图中所有的bbox的真实坐标

输出：size为（20000，1）的正负label（其中只有128个为1，128个为0，其余都为-1）、 size为（20000，4）的回归目标（所有anchor的坐标都有）

前面提到过每张图片都会生成约20000个anchor。那么问题来了，我们在RPN网络中要做的三个操作:分类，回归和提供RoIs。分类和回归的groundtruth怎么获取？如何给20000个ancor在分类的时候赋予正负标签gt_rpn_label？如何给回归操作赋予回归的真值gt_rpn_loc？这就是这个类的作用，利用每张图片bbox的真实标签来为所有任务分配groundtruth。

注意虽然是给所有20000个anchor赋予了groundtruth，但是我们只从中提任意挑选128个正类和128个负类共256个样本来训练。 不利用所有样本训练的原因是显然图中负类远多于正类样本数目。同样回归也只挑256个anchor来完成。

此函数首先將一张图中所有20000个anchor中完整包含在图像中的anchor筛选出来，假如挑出15000个anchor，要记录下来这部分的索引。然后利用函数bbox_iou计算15000个anchor与真实bbox的IOU。利用函数_create_label根据行列索引分别求出每个anchor与哪个bbox的iou最大，以及最大值，然后返回最大iou的索引argmax_ious(即每个anchor与真实bbox最大iou索引)与label(label中背景为-1，负样本为0，正样本为1)。注意，虽然是要挑选256个，但是这里返回的label仍然是全部，只不过label里面有128个0，128个为1，其余都是-1而已。然后函数bbox2loc利用的返回索引argmax_ious来计算回归的目标参数组loc。然后根据之前记录的索引，將15000个再映射回20000长度的label(其余的label一律置为-1)和loc(其余loc一律置为(0,0,0,0))。有了RPN网络两个1*1卷积输出的类别label和位置参数loc的预测值，AnchorTargetCreator又为其对应生成了真实值ground truth。那么AnchorTargetCreator的损失函数rpn_loss就有了:

$L_{(\{p_i\},\{t_i\})}=\frac{1}{N_{cls}}\sum_iL_{cls}(p_i, p_i^*) + \lambda\frac{1}{N_{reg}}\sum_ip^*_iL_{reg}(t_i, t_i^*)$

这里的 i 是anchor的索引， $p_i$ 是achor i 中有目标的概率， $p_i^*$ 是真值，如果anchor i postive，则为1，negative为0。 $t_i$ 是一个向量代表预测的bounding box的值， $t_i^*$ 是 positive anchors的bounding box 的真值。 $L_{cls}$ 是二分类的log loss 而 $L_{reg}(t_i, t_i^*)=R(t_i - t_i^*)$ 是 smooth L1 loss。

AnchorTargetCreator 代码

class AnchorTargetCreator(object):
    def __init__(self,
                 n_sample=256,
                 pos_iou_thresh=0.7, neg_iou_thresh=0.3,
                 pos_ratio=0.5):
        self.n_sample = n_sample
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh = neg_iou_thresh
        self.pos_ratio = pos_ratio
    def __call__(self, bbox, anchor, img_size): #anchor:(S,4),S为anchor数
       

        img_H, img_W = img_size

        n_anchor = len(anchor) #一般对应20000个左右anchor
        #将那些超出图片范围的anchor全部去掉,只保留位于图片内部的序号
        inside_index = _get_inside_index(anchor, img_H, img_W)
         #保留位于图片内部的anchor
        anchor = anchor[inside_index]
        #筛选出符合条件的正例128个负例128并给它们附上相应的label
        argmax_ious, label = self._create_label(
            inside_index, anchor, bbox)

        # compute bounding box regression targets
        #计算每一个anchor与对应bbox求得iou最大的bbox计算偏移量（注意这里是位于图片内部的每一个）
        loc = bbox2loc(anchor, bbox[argmax_ious])

        # map up to original set of anchors
        #将位于图片内部的框的label对应到所有生成的20000个框中（label原本为所有在图片中的框的）
        label = _unmap(label, n_anchor, inside_index, fill=-1)
        #将回归的框对应到所有生成的20000个框中（label原本为所有在图片中的框的）
        loc = _unmap(loc, n_anchor, inside_index, fill=0)

        return loc, label

    def _create_label(self, inside_index, anchor, bbox):
        # label: 1 is positive, 0 is negative, -1 is dont care
        #inside_index为所有在图片范围内的anchor序号
        label = np.empty((len(inside_index),), dtype=np.int32)
        #全部填充-1
        label.fill(-1)

        argmax_ious, max_ious, gt_argmax_ious = \
            self._calc_ious(anchor, bbox, inside_index)

        # assign negative labels first so that positive labels can clobber them
        #把每个anchor与对应的框求得的iou值与负样本阈值比较，若小于负样本阈值，则label设为0，pos_iou_thresh=0.7, neg_iou_thresh=0.3
        label[max_ious < self.neg_iou_thresh] = 0

        # positive label: for each gt, anchor with highest iou
        #把与每个bbox求得iou值最大的anchor的label设为1
        label[gt_argmax_ious] = 1

        # positive label: above threshold IOU
        #把每个anchor与对应的框求得的iou值与正样本阈值比较，若大于正样本阈值，则label设为1
        label[max_ious >= self.pos_iou_thresh] = 1

        # subsample positive labels if we have too many
        #按照比例计算出正样本数量，pos_ratio=0.5，n_sample=256
        n_pos = int(self.pos_ratio * self.n_sample)
        pos_index = np.where(label == 1)[0] #得到所有正样本的索引
        #如果选取出来的正样本数多于预设定的正样本数，则随机抛弃，将那些抛弃的样本的label设为-1
        if len(pos_index) > n_pos:
            disable_index = np.random.choice(
                pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1

        # subsample negative labels if we have too many
        n_neg = self.n_sample - np.sum(label == 1)
        neg_index = np.where(label == 0)[0]
        
        if len(neg_index) > n_neg:  #负样本的索引
        #随机选择不要的负样本，个数为len(neg_index)-neg_index，label值设为-1
            disable_index = np.random.choice(
                neg_index, size=(len(neg_index) - n_neg), replace=False)
            label[disable_index] = -1

        return argmax_ious, label

    def _calc_ious(self, anchor, bbox, inside_index):
        # ious between the anchors and the gt boxes
        #调用bbox_iou函数计算anchor与bbox的IOU， ious：（N,K），N为anchor中第N个，K为bbox中第K个，N大概有15000个
        ious = bbox_iou(anchor, bbox)
        #1代表行，0代表列
        argmax_ious = ious.argmax(axis=1) 
        #求出每个anchor与哪个bbox的iou最大，以及最大值，max_ious:[1,N]
        max_ious = ious[np.arange(len(inside_index)), argmax_ious]
        gt_argmax_ious = ious.argmax(axis=0)
        #求出每个bbox与哪个anchor的iou最大，以及最大值,gt_max_ious:[1,K]
        gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])]
        #然后返回最大iou的索引（每个bbox与哪个anchor的iou最大),有K个
        gt_argmax_ious = np.where(ious == gt_max_ious)[0]

        return argmax_ious, max_ious, gt_argmax_ious

conclusion

这一章梳理了Faster-RCNN的Backbone和RPN代码。复杂主要体现在RPN的理解上。RPN首先接收了来自于backbone的特征图，然后在这个特征图上，需要进行RoI的提取。那所谓的RoI就是微调的anchor。anchor来自于两个函数，一个是generate_anchor_base 函数和_enumerate_shifted_anchor函数。微调的参数来自于RPN网络的两个分支输出。通过ProposalCreator来进行微调anchor和最终提取2000个RoIs输出到Head网络中。那要训练这个RPN网络，就需要大约anchor的ground-truth，因此需要AnchorTargetProposal类来生成。实际上，训练RPN并没有用到全部20000个anchor，而是挑选出256个anchor进行训练，分别是128个正负样本。

Reference:

https://blog.csdn.net/weixin_43615373/article/details/108545876#t2
https://www.cnblogs.com/king-lps/p/8981222.html
https://blog.csdn.net/sinat_33486980/article/details/81099093
https://blog.csdn.net/u011436429/article/details/80279536

Tianchao龙虾

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Faster-RCNN 代码阅读笔记(一)

Faster-RCNN 代码阅读笔记(一)代码链接:https://github.com/chenyuntc/simple-faster-rcnn-pytorch可以看到，网络结构分为三个部分:Backbone: VGG16Region Proposal NetworkClassfication and Regression1. BackboneFaster-RCNN 是以VGG16作为backbone，代码具体如下:decom_vgg16 代码def decom_vgg16()
复制链接

扫一扫