用PyTorch从零开始实现Faster RCNN

最新推荐文章于 2024-08-09 07:26:31 发布

AiFool

最新推荐文章于 2024-08-09 07:26:31 发布

阅读量3.4k

点赞数

计算机视觉专栏收录该内容

4 篇文章 0 订阅

订阅专栏

作者：Prakashjay. 贡献： Suraj Amonkar, Sachin Chandra, Rajneesh Kumar 和 Vikash Challa.

多谢您的阅读，学习愉快。

原标题：Guide to build Faster RCNN in PyTorch

作者 | Machine-Vision Research Group

原文链接：https://medium.com/@fractaldle/guide-to-build-faster-rcnn-in-pytorch-95b10c273439

引言

Faster R-CNN是首次完全采用Deep Learning的学习框架之一。Faster R-CNN是基于Fast RCNN的思路，然而Fast RCNN却继承自RCNN，SPP-Net的思路（译者注：此处理清楚先后关系）。虽然我们在构建Faster RCNN框架时引入了一些Fast RCNN的思想，但是我们不会详细讨论这些框架。其中一个原因是，Faster R-CNN表现得非常好，它没有使用传统的计算机视觉技术，如选择性搜索等。在非常高的层次上，Fast RCNN和Faster RCNN的工作原理如下面的流程图所示。

Fast RCNN和Faster RCNN

我们写过一篇关于目标检测框架的详细的博客，可以作为独自编码理解Faster RCNN的指导。

上图可以看到唯一的区别是Faster RCNN中将selective search替换为RPN(Region Proposal Network)，selective search算法采用SIFT和HOG描述子来生成目标候选，在CPU上2秒/张图像。这一过程代价高，Fast RCNN在一张图像上总共耗费2.3秒产生预测，Faster RCNN速度为5 FPS（每秒的帧数），即使在后端使用非常深入的图像分类器，如VGGnet(现在也使用ResNet和ResNext)。

因此，为了从零开始构建Faster RCNN，需要明确理解以下4个主题（流程）：

Region Proposal network (RPN)
RPN loss functions
Region of Interest Pooling (ROI)
ROI loss functions
RPN还引入了一个新的概念：Anchor boxes，这成为构建目标检测框架的一个黄金准则。下面我们深入理解目标检测框架的不同步骤在Faster RCNN中如何发挥作用。

在训练Faster RCNN时通常的数据流如下：

从图像中提取特征；
产生anchor目标；
RPN网络中得到位置和目标预测分值；
取前N个坐标及其目标得分即建议层；
传递前N个坐标通过Fast R-CNN网络，生成4中建议的每个位置的位置和cls预测；
对4中建议的每个坐标生成建议目标；
采用2,3计算rpn_cls_loss和rpn_reg_loss；
采用5,6计算roi_cls_loss和roi_reg_loss；
配置VGG16作为实验后期的网络，可以用相似的方式采用其他任意的分类网络。

特征提取

我们从一张图像和一组边界框开始，其标签定义如下：

import torch
image = torch.zeros((1, 3, 800, 800)).float()
bbox = torch.FloatTensor([[20, 30, 400, 500], [300, 400, 500, 600]])

[y1, x1, y2, x2] format

labels = torch.LongTensor([6, 8]) # 0 represents background
sub_sample = 16

VGG16网络作为特征提取模块，这是RPN和Fast RCNN的支柱，为此需要对VGG16网络进行修改，网络输入为800，特征提取模块的输出的特征图尺寸为（800//16），因此需要保证VGG16模块可以得到这样的特征图储存并且将网络修剪整齐，可以通过如下方式实现：

创建一个dummy image，并将volatile设置为False;
列出VGG16所有的层；
当图像(feature map)的output_size低于所需的级别(800//16)时，将图像传递到各个层并对列表取子集；
将list转换为Sequential module;
来看看每个步骤：

生成一个dummy image并且设置volatile为False：

import torchvision
dummy_img = torch.zeros((1, 3, 800, 800)).float()
print(dummy_img)
#Out: torch.Size([1, 3, 800, 800])
2. 列出VGG16的所有层：

model = torchvision.models.vgg16(pretrained=True)
fe = list(model.features)
print(fe) # length is 15

[Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False),

Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False),

Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False),

Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False),

Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),

ReLU(inplace),

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False)]

将图像传输通过所有层，确定得到相应的尺寸：

req_features = []
k = dummy_img.clone()
for i in fe:
k = i(k)
if k.size()[2] < 800//16:
break
fee.append(i)
out_channels = k.size()[1]
print(len(req_features)) #30
print(out_channels) # 512
4. 将list转换为Sequential module：

faster_rcnn_fe_extractor = nn.Sequential(*req_features)
现在faster_rcnn_fe_extractor可以作为后端，计算特征：

out_map = faster_rcnn_fe_extractor(image)
print(out_map.size())
#Out: torch.Size([1, 512, 50, 50])
Anchor boxes

这是我们第一次遇到anchor boxes。详细理解将使我们能够非常容易地理解目标检测。所以让我们详细谈谈这是如何做到的。

在一个feature map的坐标上生成Anchor；
在所有feature map的坐标上生成Anchor；
对每个目标分配标签及坐标（相对于anchor）；
在feature map坐标生成Anchor；
将采用anchor_scales=8，16，32，ratio=0.5，1，2，sub sampling=16（因为我们将图像从800像素池化至50像素）。输出的feature map的每个像素对应图像中的16*16像素，如下图所示：

image to feature map mapping

我们需要首先在这个16 * 16像素上生成锚框，然后沿着x轴和y轴进行类似的操作，以获得所有的anchor boxes。这在步骤2中完成。
feature map的每个像素位置生成9个anchor boxes（anchor_scales的数量和ratio的数量），每个anchor box具有‘y1’,‘x1’,‘y2’,‘x2’。因此每个位置anchor会具有形状（9,4）。开始为一个空的全0的数组。
import numpy as np
ratio = [0.5, 1, 2]
anchor_scales = [8, 16, 32]
anchor_base = np.zeros((len(ratios) * len(scales), 4), dtype=np.float32)
print(anchor_base)
#Out:

array([[0., 0., 0., 0.],

[0., 0., 0., 0.],

[0., 0., 0., 0.]], dtype=float32)

让我们用相应的y1、x1、y2、x2填充这些值。我们的这个基础anchor的中心将在：

ctr_y = sub_sample / 2.
ctr_x = sub_sample / 2.
print(ctr_y, ctr_x)

Out: (8, 8)

for i in range(len(ratios)):
for j in range(len(anchor_scales)):
h = sub_sample * anchor_scales[j] * np.sqrt(ratios[i])
w = sub_sample * anchor_scales[j] * np.sqrt(1./ ratios[i])
index = i * len(anchor_scales) + j
anchor_base[index, 0] = ctr_y - h / 2.
anchor_base[index, 1] = ctr_x - w / 2.
anchor_base[index, 2] = ctr_y + h / 2.
anchor_base[index, 3] = ctr_x + w / 2.
#Out:

array([[ -37.254833, -82.50967 , 53.254833, 98.50967 ],

[ -82.50967 , -173.01933 , 98.50967 , 189.01933 ],

[-173.01933 , -354.03867 , 189.01933 , 370.03867 ],

[ -56. , -56. , 72. , 72. ],

[-120. , -120. , 136. , 136. ],

[-248. , -248. , 264. , 264. ],

[ -82.50967 , -37.254833, 98.50967 , 53.254833],

[-173.01933 , -82.50967 , 189.01933 , 98.50967 ],

[-354.03867 , -173.01933 , 370.03867 , 189.01933 ]],

dtype=float32)

这些是第一个feature map像素的Anchor位置，我们现在必须在feature map的所有位置生成这些anchor。还要注意，negitive值表示anchor boxes在图像维度之外。在后面的部分中，我们将用-1标记它们，并在计算函数损失和生成Anchor建议时删除它们。而且由于我们在每个位置都有9个Anchor，并且在一个图像中有50 * 50个这样的位置，我们总共会得到17500个(50 * 50 * 9) Anchor。让我们现在生成其他Anchor 。（译者注：此处生成的是备选Anchor，即所有可能的Anchor）

在所有特征图位置生成Anchor

为了实现这一目标，首先需要为每个feature map像素生成中心（复原到原始图像的位置点）：

fe_size = (800//16)
ctr_x = np.arange(16, (fe_size+1) * 16, 16)
ctr_y = np.arange(16, (fe_size+1) * 16, 16)
遍历ctr_x和ctr_y可以得到每个位置的中心，代码如下：

For x in shift_x:
For y in shift_y:
Generate anchors at (x, y) locations
anchor中心可视化如下：

图像中的anchor中心

采用python生成中心：

index = 0
for x in range(len(ctr_x)):
for y in range(len(ctr_y)):
ctr[index, 1] = ctr_x[x] - 8
ctr[index, 0] = ctr_y[y] - 8
index +=1
输出将是每个位置的(x, y)值，如上图所示。我们总共有2500个锚点。现在我们需要在每个中心生成anchor boxes。这可以使用我们在一个位置生成锚的代码来完成，为提供每个anchor中心的代码添加一个提取for循环就可以了。让我们看看这是怎么做的：
anchors = np.zeros((fe_size * fe_size * 9), 4)
index = 0
for c in ctr:
ctr_y, ctr_x = c
for i in range(len(ratios)):
for j in range(len(anchor_scales)):
h = sub_sample * anchor_scales[j] * np.sqrt(ratios[i])
w = sub_sample * anchor_scales[j] * np.sqrt(1./ ratios[i])
anchors[index, 0] = ctr_y - h / 2.
anchors[index, 1] = ctr_x - w / 2.
anchors[index, 2] = ctr_y + h / 2.
anchors[index, 3] = ctr_x + w / 2.
index += 1
print(anchors.shape)
#Out: [22500, 4]
注：为了简单起见，我让这段代码看起来非常冗长。有更好的方法来生成anchor boxes。

这会成为图像最终的anchor用于后续的步骤，下面可视化这些anchor如何在图像上传播。

anchor boxes at (400, 400)：

一张图像中所有有效的anchor boxes示意图

分配目标标签和位置给每个anchor

现在，由于我们已经生成了所有的anchor boxes，我们需要查看图像中的目标，并将它们分配给包含它们的特定anchor boxes。Faster_R-CNN有一些给anchor boxes分配标签的指导原则。

我们给两种anchor分配了正标签：

a)与ground-truth-box重叠度最高的Intersection-over-Union (IoU)的anchor；

b)与ground-truth box 的IoU重叠度大于0.7的anchor。

注意，单个ground-truth对象可以为多个anchor分配正标签。

c)对所有与ground-truth box的IoU比率小于0.3的anchor标记为负标签；

d)anchor既不是正样本的也不是负样本，对训练没有帮助。

让我们看看这是怎么做的。

bbox = np.asarray([[20, 30, 400, 500], [300, 400, 500, 600]],
dtype=np.float32) # [y1, x1, y2, x2] format
labels = np.asarray([6, 8], dtype=np.int8) # 0 represents background
通过如下方式对anchor boxes分配标签和位置：

找到有效的anchor boxes的索引，并且生成索引数组，生成标签数组其形状索引数组填充-1（无效anchor boxes，对应上文说的处在边框外的anchor boxes）。
检查是否满足以上a、b、c条件中的一条，并相应填写标签。如果是正anchor box(标签为1)，注意哪个ground-truth目标可以得到这个结果。
计算与anchor box相关的ground-truth的位置(loc)。
通过为所有无效的anchor box填充-1和为所有有效锚箱计算的值来重新组织所有锚箱。
输出应该是(N, 1)数组的标签和带有(N, 4)数组的locs。
找到所有有效anchor boxes的索引：
index_inside = np.where(
(anchors[:, 0] >= 0) &
(anchors[:, 1] >= 0) &
(anchors[:, 2] <= 800) &
(anchors[:, 3] <= 800)
)[0]
print(index_inside.shape)
#Out: (8940,)
生成空的标签数组，大小为inside_index，填充-1，默认设置为(d)：
label = np.empty((len(inside_index), ), dtype=np.int32)
label.fill(-1)
print(label.shape)
#Out = (8940, )
生成有效anchor boxes数组：
valid_anchor_boxes = anchors[inside_index]
print(valid_anchor_boxes.shape)
#Out = (8940, 4)
对每个有效anchor box计算与每个ground-truth目标的iou。因为我们有8940个anchor boxes和2个ground-truth目标，应该得到（8940,2）的数组作为输出。两个框之间iou计算的代码如下：

Find the max of x1 and y1 in both the boxes (xn1, yn1)
Find the min of x2 and y2 in both the boxes (xn2, yn2)
Now both the boxes are intersecting only
if (xn1 < xn2) and (yn2 < yn1)
iou_area will be (xn2 - xn1) * (yn2 - yn1)
else
iuo_area will be 0
similarly calculate area for anchor box and ground truth object
iou = iou_area/(anchor_box_area + ground_truth_area - iou_area)
计算iou的python代码如下：

ious = np.empty((len(valid_anchors), 2), dtype=np.float32)
ious.fill(0)
print(bbox)
for num1, i in enumerate(valid_anchors):
ya1, xa1, ya2, xa2 = i
anchor_area = (ya2 - ya1) * (xa2 - xa1)
for num2, j in enumerate(bbox):
yb1, xb1, yb2, xb2 = j
box_area = (yb2- yb1) * (xb2 - xb1)
inter_x1 = max([xb1, xa1])
inter_y1 = max([yb1, ya1])
inter_x2 = min([xb2, xa2])
inter_y2 = min([yb2, ya2])
if (inter_x1 < inter_x2) and (inter_y1 < inter_y2):
iter_area = (inter_y2 - inter_y1) *
(inter_x2 - inter_x1)
iou = iter_area /
(anchor_area+ box_area - iter_area)
else:
iou = 0.
ious[num1, num2] = iou
print(ious.shape)
#Out: [22500, 2]
注意：采用numpy数组，可以使计算效率更高并且冗余更少。然而，我们这样做的原因是可以使没有很强的代数知识的人也能理解。

考虑a和b的情况，我们需要找到两件事

每个gt_box及其对应的anchor box的最高iou；
每个anchor box及其对应的ground-truth box的最高iou；
case-1：

gt_argmax_ious = ious.argmax(axis=0)
print(gt_argmax_ious)
gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])]
print(gt_max_ious)

Out:

[2262 5620]

[0.68130493 0.61035156]

case-2：

argmax_ious = ious.argmax(axis=1)
print(argmax_ious.shape)
print(argmax_ious)
max_ious = ious[np.arange(len(inside_index)), argmax_ious]
print(max_ious)

Out:

(22500,)

[0, 1, 0, …, 1, 0, 0]

[0.06811669 0.07083762 0.07083762 … 0. 0. 0. ]

找到具有max_ious的anchor_boxes（gt_max_ious）：

gt_argmax_ious = np.where(ious == gt_max_ious)[0]
print(gt_argmax_ious)

Out:

[2262, 2508, 5620, 5628, 5636, 5644, 5866, 5874, 5882, 5890, 6112,

6120, 6128, 6136, 6358, 6366, 6374, 6382]

至此有3个数组：

argmax_ious——确定哪个ground-truth目标与每个anchor都有最大的iou；
max_ious——确定ground-truth目标与每个anchor的max_iou；
gt_argmax_ious——确定与ground-truth box有最大的IoU重叠的anchor。
用argmax_ious和max_ious可以为满足[b],[c]的anchor box分配标签和位置，使用gt_argmax_iou，我们可以为anchor box分配标签和位置，以满足[a]。

对一些变量施加阈值：

pos_iou_threshold = 0.7
neg_iou_threshold = 0.3
分配负标签（0）给max_iou小于负阈值[c]的所有anchor boxes：
label[max_ious < neg_iou_threshold] = 0
分配正标签（1）给与ground-truth box[a]的IoU重叠最大的anchor boxes：
label[gt_argmax_ious] = 1
分配正标签（1）给max_iou大于positive阈值[b]的anchor boxes：
label[max_ious >= pos_iou_threshold] = 1
训练RPN，Faster_R-CNN文章描述如下：

每个mini-batch都来自包含许多正样本和负样本anchor的单个图像，但这将偏向负样本，因为它们占主导地位。相反，我们随机采样图像中的256个anchor来计算mini-batch的损失函数，其中被采样的正锚点和负锚点的比例高达1:1。如果一幅图像中有少于128个正样本，我们就用负样本填充mini-batch图像。由此我们可以推出两个变量如下：

pos_ratio = 0.5
n_sample = 256
所有的正样本：

n_pos = pos_ratio * n_sample
现在我们需要从正标签中随机采样n_pos个样本，忽略（-1）剩余的样本。在一些情况下，得到少于n_pos个样本，此时随机采样（n_sample-n_pos）个负样本（0），忽略剩余的anchor boxes，用如下代码实现：

positive samples
pos_index = np.where(label == 1)[0]
if len(pos_index) > n_pos:
disable_index = np.random.choice(pos_index, size=(len(pos_index) - n_pos), replace=False)
label[disable_index] = -1
negative samples
n_neg = n_sample * np.sum(label == 1)
neg_index = np.where(label == 0)[0]
if len(neg_index) > n_neg:
disable_index = np.random.choice(neg_index, size=(len(neg_index) - n_neg), replace = False)
label[disable_index] = -1
anchor boxes 定位

现在让我们用具有最大iou的ground truth对象为每个anchor box分配位置。注意，我们将为所有有效的anchor box分配anchor locs，而不考虑其标签，稍后在计算损失时，我们可以使用简单的过滤器删除它们。

我们已经知道与每个anchor box具有高的iou的ground truth目标，现在我们需要找到ground truth相对于anchor box的坐标。Faster_R-CNN按照如下参数化：

t_{x} = (x - x_{a})/w_{a}
t_{y} = (y - y_{a})/h_{a}
t_{w} = log(w/ w_a)
t_{h} = log(h/ h_a)
x, y , w, h是ground truth box的中心坐标，宽，高。x_a，y_a，h_a，w_a为anchor boxes的中心坐标，宽，高。

对每个anchor box，找到具有max_iou的ground-truth目标：
max_iou_bbox = bbox[argmax_ious]
print(max_iou_bbox)
#Out

[[ 20., 30., 400., 500.],

[ 20., 30., 400., 500.],

…,

[ 20., 30., 400., 500.],

[ 20., 30., 400., 500.]]

为找到t_{x}, t_{y}, t_{w}, t_{h}，需要转换有效的anchor boxes的y1,x1,y2,x2格式和与ground truth boxes相关的ctr_y，ctr_x，h，w格式：
height = valid_anchors[:, 2] - valid_anchors[:, 0]
width = valid_anchors[:, 3] - valid_anchors[:, 1]
ctr_y = valid_anchors[:, 0] + 0.5 * height
ctr_x = valid_anchors[:, 1] + 0.5 * width
base_height = max_iou_bbox[:, 2] - max_iou_bbox[:, 0]
base_width = max_iou_bbox[:, 3] - max_iou_bbox[:, 1]
base_ctr_y = max_iou_bbox[:, 0] + 0.5 * base_height
base_ctr_x = max_iou_bbox[:, 1] + 0.5 * base_width
用上述公式找到位置：
eps = np.finfo(height.dtype).eps
height = np.maximum(height, eps)
width = np.maximum(width, eps)
dy = (base_ctr_y - ctr_y) / height
dx = (base_ctr_x - ctr_x) / width
dh = np.log(base_height / height)
dw = np.log(base_width / width)
anchor_locs = np.vstack((dy, dx, dh, dw)).transpose()
print(anchor_locs)
#Out:

[[ 0.5855727 2.3091455 0.7415673 1.647276 ]

[ 0.49718437 2.3091455 0.7415673 1.647276 ]

[ 0.40879607 2.3091455 0.7415673 1.647276 ]

…

[-2.50802 -5.292254 0.7415677 1.6472763 ]

[-2.5964084 -5.292254 0.7415677 1.6472763 ]

[-2.6847968 -5.292254 0.7415677 1.6472763 ]]

现在得到anchor_locs和与每个有效的anchor boxes相关的标签
用inside_index变量将他们映射到原始的anchors，无效的anchor box标签填充-1（忽略），位置填充0。

最终的标签：
anchor_labels = np.empty((len(anchors),), dtype=label.dtype)
anchor_labels.fill(-1)
anchor_labels[inside_index] = label
最终的坐标：

anchor_locations = np.empty((len(anchors),) + anchors.shape[1:], dtype=anchor_locs.dtype)
anchor_locations.fill(0)
anchor_locations[inside_index, :] = anchor_locs
最终的两个矩阵为：

anchor_locations [N, 4] — [22500, 4]
anchor_labels [N,] — [22500]
这用于RPN网络的输入目标，下节可以看到RPN网络如何设计。

Region Proposal Network（RPN网络）

正如之前讨论，在先前的研究中，通过selective search，CPMC, MCG, EdgeBoxes等方法生成网络的region proposals。Faster R-CNN是第一个采用深度学习生成region proposals的方法。

RPN Network

网络包括一个卷积模块，在卷积模块下一层包括一个回归层，预测anchor中的box的位置。

为生成region proposals，在特征提取模块得到的卷积特征层输出上滑动一个小的网络。这个小网络将输入卷积特征层的n*n空间窗口作为输入，每个滑动窗口映射到更低维的特征[512维特征]。这个特征输入到两个并列的全连接层：

框回归层
框分类层
正如Faster R-CNN文章中所属，采用n=3，采用nn的卷积层和两个并列的11卷积层实现这一框架：

import torch.nn as nn
mid_channels = 512
in_channels = 512 # depends on the output feature map. in vgg 16 it is equal to 512
n_anchor = 9 # Number of anchors at each location
conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)
reg_layer = nn.Conv2d(mid_channels, n_anchor *4, 1, 1, 0)
cls_layer = nn.Conv2d(mid_channels, n_anchor *2, 1, 1, 0) ## I will
be going to use softmax here. you can equally use sigmoid if u
replace 2 with 1.
文章中，采用均值0，标准差0.01初始化权重，偏差初始化为0，通过如下代码实现：

conv sliding layer

conv1.weight.data.normal_(0, 0.01)
conv1.bias.data.zero_()

Regression layer

reg_layer.weight.data.normal_(0, 0.01)
reg_layer.bias.data.zero_()

classification layer

cls_layer.weight.data.normal_(0, 0.01)
cls_layer.bias.data.zero_()
特征提取过程的输出可以输入到网络中用于预测目标相对anchor的位置和与之相关的目标分值：

x = conv1(out_map) # out_map is obtained in section 1
pred_anchor_locs = reg_layer(x)
pred_cls_scores = cls_layer(x)
print(pred_cls_scores.shape, pred_anchor_locs.shape)
#Out:
#torch.Size([1, 18, 50, 50]) torch.Size([1, 36, 50, 50])
让我们重新格式化一下，让它与我们之前设计的锚点目标对齐。我们还将找到每个anchor box的目标得分用于proposal层，我们将在下一节中讨论：

pred_anchor_locs = pred_anchor_locs.permute(0, 2, 3, 1).contiguous().view(1, -1, 4)
print(pred_anchor_locs.shape)
#Out: torch.Size([1, 22500, 4])
pred_cls_scores = pred_cls_scores.permute(0, 2, 3, 1).contiguous()
print(pred_cls_scores)
#Out torch.Size([1, 50, 50, 18])
objectness_score = pred_cls_scores.view(1, 50, 50, 9, 2)[:, :, :, :, 1].contiguous().view(1, -1)
print(objectness_score.shape)
#Out torch.Size([1, 22500])
pred_cls_scores = pred_cls_scores.view(1, -1, 2)
print(pred_cls_scores.shape)

Out torch.size([1, 22500, 2])

本节实现以下步骤：

pred_cls_scores和pred_anchor_locs是RPN网络的输出，损失更新权重。
pred_cls_scores和objectness_scores作为proposal层的输入，proposal层生成后续用于RoI网络的一系列proposal，将在下一节中实现。
Generating proposals to feed Fast R-CNN network

proposal函数将采用如下参数：

模式：training_mode还是testing_mode。
nms_thresh。
n_train_pre_nms——训练时nms之前的bbox的数目。
n_train_post_nms——训练时nms之后的bbox的数目。
n_test_pre_nms——测试时nms之前的bbox的数目。
n_test_post_nms——测试时nms之后的bbox的数目。
min_size——生成一个proposal的所需的目标的最小高度。
Faster R-CNN中RPN的proposals彼此之间重叠程度高。为了减少冗余，我们根据proposal区域的cls分数对其进行非最大抑制(non-maximum supression, NMS)。我们将NMS的IoU阈值设置为0.7，这样每个图像就有大约2000个建议区域。经过简化研究，作者表明，NMS不损害最终检测的准确性，但大大减少了建议的数量。在NMS之后，我们使用top-N的建议区域进行检测。后续使用2000个RPN proposals训练Fast R-CNN。在测试期间，他们只评估了300个proposal，他们用不同的数字测试，并得到了如下结果：

nms_thresh = 0.7
n_train_pre_nms = 12000
n_train_post_nms = 2000
n_test_pre_nms = 6000
n_test_post_nms = 300
min_size = 16
我们需要做以下的事情产生网络感兴趣的proposal region。

转换RPN网络的loc预测为bbox[y1,x1,y2,x2]格式。
将预测框剪辑到图像上。
去除高度或宽度
通过分数从高到低排序所有的(proposal, score)对。
取前几个预测框pre_nms_topN(如，训练时12000，测试时300)
采用nms threshold > 0.7。
取前几个预测框pos_nms_topN(如，训练时2000，测试时300)
我们将在本节剩余部分中查看每个阶段：

转换RPN网络的loc预测为bbox[y1,x1,y2,x2]格式。

这是为anchor boxes设置ground truth时的逆操作，该操作通过无参数化及相对图像的偏差来解码预测。公式如下：

x = (w_{a} * ctr_x_{p}) + ctr_x_{a}
y = (h_{a} * ctr_x_{p}) + ctr_x_{a}
h = np.exp(h_{p}) * h_{a}
w = np.exp(w_{p}) * w_{a}
and later convert to y1, x1, y2, x2 format
转换anchor格式从 y1, x1, y2, x2 到 ctr_x, ctr_y, h, w ：
anc_height = anchors[:, 2] - anchors[:, 0]
anc_width = anchors[:, 3] - anchors[:, 1]
anc_ctr_y = anchors[:, 0] + 0.5 * anc_height
anc_ctr_x = anchors[:, 1] + 0.5 * anc_width
通过上述公式转换预测locs，在转换之前先将pred_anchor_loc和objectness_score为numpy数组：
pred_anchor_locs_numpy = pred_anchor_locs[0].data.numpy()
objectness_score_numpy = objectness_score[0].data.numpy()
dy = pred_anchor_locs_numpy[:, 0::4]
dx = pred_anchor_locs_numpy[:, 1::4]
dh = pred_anchor_locs_numpy[: 2::4]
dw = pred_anchor_locs_numpy[: 3::4]
ctr_y = dy * anc_height[:, np.newaxis] + anc_ctr_y[:, np.newaxis]
ctr_x = dx * anc_width[:, np.newaxis] + anc_ctr_x[:, np.newaxis]
h = np.exp(dh) * anc_height[:, np.newaxis]
w = np.exp(dw) * anc_width[:, np.newaxis]
转换 [ctr_x, ctr_y, h, w]为[y1, x1, y2, x2]格式：
roi = np.zeros(pred_anchor_locs_numpy.shape, dtype=loc.dtype)
roi[:, 0::4] = ctr_y - 0.5 * h
roi[:, 1::4] = ctr_x - 0.5 * w
roi[:, 2::4] = ctr_y + 0.5 * h
roi[:, 3::4] = ctr_x + 0.5 * w
#Out:

[[ -36.897102, -80.29519 , 54.09939 , 100.40507 ],

[ -83.12463 , -165.74298 , 98.67854 , 188.6116 ],

[-170.7821 , -378.22214 , 196.20844 , 349.81198 ],

…,

[ 696.17816 , 747.13306 , 883.4582 , 836.77747 ],

[ 621.42114 , 703.0614 , 973.04626 , 885.31226 ],

[ 432.86267 , 622.48926 , 1146.7059 , 982.9209 ]]

剪辑预测框到图像上：

img_size = (800, 800) #Image size
roi[:, slice(0, 4, 2)] = np.clip(
roi[:, slice(0, 4, 2)], 0, img_size[0])
roi[:, slice(1, 4, 2)] = np.clip(
roi[:, slice(1, 4, 2)], 0, img_size[1])
print(roi)
#Out:

[[ 0. , 0. , 54.09939, 100.40507],

[ 0. , 0. , 98.67854, 188.6116 ],

[ 0. , 0. , 196.20844, 349.81198],

…,

[696.17816, 747.13306, 800. , 800. ],

[621.42114, 703.0614 , 800. , 800. ],

[432.86267, 622.48926, 800. , 800. ]]

去除高度或宽度 < threshold的预测框：

hs = roi[:, 2] - roi[:, 0]
ws = roi[:, 3] - roi[:, 1]
keep = np.where((hs >= min_size) & (ws >= min_size))[0]
roi = roi[keep, :]
score = objectness_score_numpy[keep]
print(score.shape)
#Out:
##(22500, ) all the boxes have minimum size of 16
4. 按分数从高到低排序所有的（proposal, score）对：

order = score.ravel().argsort()[::-1]
print(order)
#Out:
#[ 889, 929, 1316, …, 462, 454, 4]
5. 取前几个预测框pre_nms_topN(如训练时12000，测试时300)：

order = order[:n_train_pre_nms]
roi = roi[order, :]
print(roi.shape)
print(roi)
#Out

(12000, 4)

[[607.93866, 0. , 800. , 113.38187],

[ 0. , 0. , 235.29704, 369.64795],

[572.177 , 0. , 800. , 373.0086 ],

…,

[250.07968, 186.61633, 434.6356 , 276.70615],

[490.07974, 154.6163 , 674.6356 , 244.70615],

[266.07968, 602.61633, 450.6356 , 692.7062 ]]

采用nms threshold > 0.7：

第一个问题，什么是非极大抑制？这是一个去除/合并具有极高重叠的边界框的过程。下图中有很多重叠的边界框，想保留一些重叠不多的独特的边界框。阈值设置为0.7，定义去除/合并重叠边界框所需的最小重叠区域面积。

NMS的代码如下：

Take all the roi boxes [roi_array]
Find the areas of all the boxes [roi_area]
Take the indexes of order the probability score in descending order [order_array]
keep = []
while order_array.size > 0:
take the first element in order_array and append that to keep
Find the area with all other boxes
Find the index of all the boxes which have high overlap with this box
Remove them from order array
Iterate this till we get the order_size to zero (while loop)
Ouput the keep variable which tells what indexes to consider.
参考文献

https://arxiv.org/abs/1506.01497
https://github.com/chenyuntc/simple-faster-rcnn-pytorch