Faster R-CNN

最新推荐文章于 2024-03-25 16:57:18 发布

羊肉串串魅力无穷

最新推荐文章于 2024-03-25 16:57:18 发布

阅读量302

点赞数

分类专栏：图像 - 目标检测

本文链接：https://blog.csdn.net/lk3030/article/details/84919522

版权

图像 - 目标检测专栏收录该内容

5 篇文章 1 订阅

订阅专栏

文章目录

论文信息

原文地址：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

系列论文：地址

代码实现：地址

作者：Shaoqing Ren，Kaiming He，Ross Girshick，Jian Sun

Faster R-CNN 框架

Faster RCNN 将特征提取(feature extraction)，候选框建议(region proposal)，候选框位置回归(bounding box regression)，和候选框分类(classification) 都整合在了一个网络中，使得综合性能有较大提高，在检测速度方面尤为明显。

Faster R-CNN 主要分为 4 部分：

Conv layers

作为一种 CNN 网络目标检测方法，Faster R-CNN 首先使用一组基础的 conv + relu + pooling 层提取图像的特征图(feature maps)。feature maps 被共享用于后续 RPN 层和全连接层。
Region Proposal Networks

RPN 网络用于生成候选框(proposals)，通过 softmax 判断 anchors 属于 foreground 或者 background，同时利用 bounding box regression 修正 anchors 获得精确的 proposals。
Roi Pooling

Roi Pooling 层收集输入的 feature maps 和 proposals，综合这些信息后提取 proposal feature maps，送入后续全连接层判定目标类别。
Classification

利用 proposal feature maps 计算 proposal 的类别，同时再次使用 bounding box regression 获得检测框最终的精确位置。

在整个Faster R-CNN算法中，有三种尺度：

原图尺度：原始输入的大小，不受任何限制，不影响性能。
归一化尺度：输入特征提取网络的大小，在测试时设置，源码中opts.test_scale = 600。anchor 在这个尺度上设定。这个参数和 anchor 的相对大小决定了想要检测的目标范围。
网络输入尺度：输入特征检测网络的大小，在训练时设置，源码中为 $224 * 224$ 。

锚点 Anchors

anchors 是作者代码中，由 rpn/generate_anchors.py 生成的候选框元素，其中每个元素的 4 个值 $(x 1, y 1, x 2, y 2)$ 表候选框左上和右下角点坐标。
作者使用 3 种面积 $\left\{ 128^2, 256^2, 521^2 \right\}$ 和 3 种长宽比 $\left\{1:1, 1:2, 2:1 \right\}$ ，在每个滑动位置产生 $k = 9$ 个 anchors 。对于大小为 $W \times H$ 的特征层，总共有 $W \times H \times k$ 个 anchors 。

生成 anchors 时，以 $3 \times 3$ 卷积核的中心点，作为anchor的中心点，在特征图上滑动，通过特征图到原图的映射，在归一化后的原图中框出多尺度、多种长宽比的anchors。

这里有一个重要的参数 feat_stride = 16，它表示特征层上移动一个点，对应原图移动16个像素点

当原图大小为 $800 \times 600$ 时，VGG下采样 16 倍，feature map每个点设置9个Anchor，则：

$A n c h o r 总数 = c e i l (800 / 16) \times c e i l (600 / 16) \times 9 = 58 \times 38 \times 9 = 17100$

其中 $c e i l ()$ 表示向上取整，因为 VGG输出的feature map size= $50 \times 38$
通过 anchors 就引入了检测中常用到的多尺度方法。

区域提议网络 Region Proposal Networks(RPN)

首先，在 conv5 之后，使用 $3 \times 3$ 卷积(ZF为256通道，VGG为512通道) 在通道数不变的情况下，将特征图上的每个点又融合了周围 $3 \times 3$ 的空间信息。

接下来，对于 feature map 中每个点处的 $k$ 个 anchor（默认 $k = 9$ ）：

由于每个 anhcor 要分 foreground 和 background，所以使用 $1 \times 1$ 卷积，将特征图的通道数(ZF为256通道，VGG为512通道) 转化为 cls= $2 k$ scores （即 $2 * 9 = 18$ 通道），得到 anchors 前景/背景分类分支。

通过 Softmax 分类 anchors 获得 foreground 和 background（检测目标是foreground）。
由于每个 anchor 都有 $[x, y, w, h]$ 对应的 4 个偏移量，所以使用 $1 \times 1$ 卷积，将特征图的通道数转化为 reg= $4 k$ coordinates （即 $4 * 9 = 36$ 通道），得到 anchors 位置回归分支。

通过计算对于 anchors 的 bounding box regression 偏移量，以获得精确的 proposal。

最后的 Proposal Layer 则负责综合 foreground anchors 和 bounding box regression 偏移量获取 proposals，同时剔除太小和超出边界的proposals。

由于 anchors 数量太多，作者在训练时，会在合适的 anchors 中随机选取 128 个 postive anchors + 128 个negative anchors 进行训练

Proposal Layer

proposal layer 基于 anchor generation layer 得到的 anchor，使用nms（基于 foreground 的score）来筛除多余的 anchor。此外，它还负责将 RPN 得到的regression coefficients应用到对应的anchor上，从而得到transformed bbox。

Proposal Layer有3个输入：

fg/bg anchors分类器结果rpn_cls_prob_reshape：
对应的bbox reg的 $d_{x}(A),d_{y}(A),d_{w}(A),d_{h}(A)]$ 变换量rpn_bbox_pred。
原图信息 im_info：
对于一副任意大小 $P \times Q$ 图像，传入 Faster R-CNN 前首先 reshape 到固定 $M \times N$ ，im_info = [M, N, scale_factor] ，保存了此次缩放的所有信息。
参数feat_stride=16：
图像经过Conv Layers，经过4次 pooling变为 $W \times H = (M / 16) \times (N / 16)$ 大小，其中feature_stride=16则保存了该信息，用于计算 anchor 偏移量。

Proposal Layer 的前向传播按照以下顺序依次处理：

生成anchors，利用 $d_{x}(A),d_{y}(A),d_{w}(A),d_{h}(A)]$ 对所有的anchors 做 bbox regression 回归（这里的anchors生成和训练时完全一致）
按照输入的 foreground softmax scores 由大到小排序 anchors，提取前 pre_nms_topN(e.g. 6000) 个anchors，即提取修正位置后的foreground anchors。
限定超出图像边界的 foreground anchors 为图像边界（防止后续 roi pooling 时 proposal 超出图像边界）
剔除非常小（width < threshold or height < threshold）的 foreground anchors
进行nonmaximum suppression
再次按照 nms 后的 foreground softmax scores 由大到小排序 foreground anchors，提取前 post_nms_topN(e.g. 300) 结果作为proposal输出。（proposal $= [x 1, y 1, x 2, y 2]$ ）

由于在第三步中将anchors映射回原图判断是否超出边界，所以最后输出的 proposal 是对应 $M \times N$ 输入图像尺度的。

RPN 损失函数

为了训练 RPN，作者为每个锚点分配一个二值类别标签（前景或背景），其中：

与任意真实边界框的 IoU 都低于0.3的 anchor，标签为负（背景）。
标签为正（前景）的 anchor 分为以下2种：
（i）与真实边界框的重叠交并比（IoU）最高的 anchor
（ii）与真实边界框的 IoU 超过0.7 的 anchor

单个真实边界框可以为多个anchor 分配正标签。通常第二个条件足以确定正样本；但仍然采用第一个条件，因为在一些极少数情况下，第二个条件可能找不到正样本。

根据这些定义，多任务损失函数定义为： $L(\lbrace p_i \rbrace, \lbrace t_i \rbrace) = \frac{1}{N_{cls}}\sum_i L_{cls}(p_i, p^{*}_i) \\ + \lambda\frac{1}{N_{reg}}\sum_i p^{*}_i L_{reg}(t_i, t^{*}_i).$

其中：

cls 和 reg 层的输出分别由 ${p_i}$ 和 ${t_i}$ 组成。
$i$ 是一个 mini-batch 数据中 anchor 的索引。

分类部分：

$p_i$ 是 anchor $i$ 为前景的预测概率。
当 anchor $i$ 为前景时 $p_i^{\ast} =1$ ，否则 $p_i^{\ast} =0$
$L_{cls}$ 是 RPN 中前景/背景分类的对数损失函数。

回归部分：

$t_i$ 表示预测边框4个坐标参数的向量。
$t^{*}_i$ 表示与前景 anchor 相关的真实边界框的向量。
对于边框回归，回归参数为每个样本的坐标 $[x, y, w, h]$ ，表示边框的中心位置和宽高，考虑三组参数：
- 预测框(predicted box)坐标 $[x, y, w, h]$
- anchor坐标 $x_a,y_a,w_a,h_a]$
- ground truth坐标 $[x^{\ast},y^{\ast},w^{\ast},h^{\ast}]$
计算预测框相对 anchor 中心位置的偏移量以及宽高的缩放量 ${t}$ ：
$t_{\textrm{x}} = (x - x_{\textrm{a}})/w_{\textrm{a}}$

$t_{\textrm{y}} = (y - y_{\textrm{a}})/h_{\textrm{a}}$

$t_{\textrm{w}} = \log(w / w_{\textrm{a}})$

$t_{\textrm{h}} = \log(h / h_{\textrm{a}})$

ground truth 相对 anchor 的偏移量和缩放量 ${t∗}$ ：

$t^{*}_{\textrm{x}} = (x^{*} - x_{\textrm{a}})/w_{\textrm{a}}$

$t^{*}_{\textrm{y}} = (y^{*} - y_{\textrm{a}})/h_{\textrm{a}}$

$t^{*}_{\textrm{w}} = \log(w^{*} / w_{\textrm{a}})$

$t^{*}_{\textrm{h}} = \log(h^{*} / h_{\textrm{a}})$

回归目标就是让 ${t}$ 尽可能地接近 ${t∗}$
RPN 中检测框回归部分的损失函数为：

$L_{reg}(t_i, t^{*}_i)=R(t_i - t^{*}_i)$

其中 $R$ 采用 Smooth L1 函数计算回归损失：

$Smooth_{L1}(x)=\begin{cases} 0.5x^2 |x| \leq 1 \\ |x|-0.5 otherwise\end{cases}$

Smooth L1 损失函数曲线如下图所示，相比于 L2 损失函数，L1 对离群点或异常值不敏感，可控制梯度的量级使训练更易收敛。

$p^{*}_i L_{reg}$ 表示回归损失仅对于前景 anchor 激活，否则被禁用（ $p^{*}_i=0$ ）。

标准化：

分类损失和回归损失分别使用 $N_{cls}$ 和 $N_{reg}$ 进行标准化，并由一个平衡参数 $\lambda$ 加权。
在作者目前的实现中， $c l s$ 项以 mini-batch 的批量（即 $N_{cls}=256$ ）进行归一化， $r e g$ 项根据 anchor 位置的数量（即特征图的像素数， $N_{reg}\sim 2400$ ）进行归一化。
默认情况下，设置 $\lambda=10$ ，因此 cls 和 reg 项的权重大致相等。
通过实验显示，结果对宽范围的 $\lambda$ 值不敏感。同时，标准化不是必需的，可以简化。

使用不同的 $\lambda$ 值在PASCAL VOC 2007测试集上的检测结果：