Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

最新推荐文章于 2024-07-31 14:30:10 发布

Rongxin_Ma

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量740

点赞数 18

文章标签： cnn 目标检测 python 深度学习

本文链接：https://blog.csdn.net/m17635840562/article/details/134989937

版权

1. 文章简介

发表会议：NeurIPS-2015
文章地址：https://arxiv.org/abs/1506.01497
代码地址：https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/faster_rcnn
数据集地址：https://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar
摘要：State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with ‘attention’ mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
推荐原因：博客作者原以时间序列异常检测为主要研究方向，初读图像领域的目标检测paper，挑选了NIPS 2015的这篇Faster R-CNN，虽然发表时间比较长但是与2017年同样在NIPS发表的Attention is all your need(Transformer，不仅在NLP(自然语言)领域有深远影响，在图像、Time Series、推荐等领域也成为了使用至今的热门方法，因此拥有了“万物皆可Transformer”的评价)一样，致今任为受广泛研究的热门方法（经典永流传），在图像目标检测有较高的地位。因此将这篇文章作为作者入坑学习的第一篇paper。

2. 提出的方法

Faster R-cnn总体结构图

2.1 总体结构

算法的模型图由上图所示。Faster R-cnn的结构可以分为4个主要部分：

conv layers(卷积层)：经博客作者阅读代码，conv layers是由一组conv+relu+pooling层构成（在博客作者之前的深度学习算法研究过程中，这样的结构类似于Linear+relu组成的MLP，用来学习非线性复杂关系）。conv layers用来提取图像的特征，生成feature maps(类似色彩、边界、像素点之间相关性等)。feature maps用于后续层学习。
Region Proposal Networks（区域建议网络，RPN）：RPN用来学习Region Proposal（候选区域），获得候选区域分数用来判断该区域是positive还是negative。
Roi Pooling：联合feature maps和proposals中信息，获取属于Proposal的feature maps。
Classification：通过proposal feature maps计算proposal的类别。

2.2 Region Proposal Networks(RPN)

在这里插入图片描述

经典的检测方法生成检测框都非常耗时，Faster RCNN直接使用RPN生成检测框，这也是Faster R-CNN的巨大优势，能极大提升检测框的生成速度。
区域建议网络(RPN)以任意大小的图像作为输入，输出一组矩形object Proposal，每个object Proposal都有一个object Score,我们用一个全卷积网络来模拟这个过程最终目标是输入a box-regression layer(reg,回归层)和 a box-classification layer (cls,分类层)与Fast R-CNN目标检测网络共享计算。

2.2.1 Anchors

在每个滑动窗口位置，同事预测多个Region Proposal，最大数量为k个，经过中间层后，在reg会产生4k个坐标（每个anchor都有(x, y, w, h)对应4个偏移量，所以reg=4•k coordinates），分类层中产生2k个分数（每个anhcor要分positive和negative，所以每个点由256d feature转化为cls=2•k scores），该分数表示每个框中对象的概率。在每个滑动窗口中有k个Anchors(锚点)。对于尺度为W·H的卷积映射，总共会产生WHk个Anchors(锚点)。

2.2.2 box-regression layer (reg)

文中提到的 Translation-Invariant Anchors 和 Multi-Scale Anchors as Regression References意在使用回归方法解决坐标框修正问题。
在这里插入图片描述
上图示例中，绿框表示Ground Truth，红框代表模型学习到的positive object。为了将红框坐标修正结晶真实框，作者通过平移和缩放即寻找两者之间的映射关系，即线性变换(我认为只有在两框比较接近的时候，两者之间关系可以通过线性回归模型进行建模，所以此过程只适用于微调，不适用于调整差别较大的框)，从而得到更加精确的框，取Top K的框送入RoI pooling层。

2.3 Loss Function

$L({p_{i}},{t_{i}}) = \frac{1}{N_{cls}}\sum_{i}L_{cls}(p_{i},p_{i}^{*})+\lambda \frac{1}{N_{reg}}\sum_{i}p_{i}^{*}L_{reg}(t_{i},t_{i}^{*})$
其中，p_i，t_i为锚点i对应对象的预测概率和预测边界框的4个参数化坐标向量，p_i^* , t_i^* 是ground truth。

3. 实验

在这里插入图片描述
由于本文的重大贡献在于使用RPN机制，因此在实验部分只关注了有关RPN的消融实验。我认为RPN实际上就是在feature map上设置了一定数量的Anchors，并使用reg对不同尺寸的Box进行修成（修正过程设置为一个简单的线性问题），模型、代码结构并不复杂，运行速度快，计算资源消耗小。
在这里插入图片描述

Rongxin_Ma

关注

18
点赞
踩
21

收藏

觉得还不错? 一键收藏
1
评论
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

在每个滑动窗口位置，同事预测多个Region Proposal，最大数量为k个，经过中间层后，在reg会产生4k个坐标（每个anchor都有(x, y, w, h)对应4个偏移量，所以reg=4•k coordinates），分类层中产生2k个分数（每个anhcor要分positive和negative，所以每个点由256d feature转化为cls=2•k scores），该分数表示每个框中对象的概率。上图示例中，绿框表示Ground Truth，红框代表模型学习到的positive object。
复制链接

扫一扫