论文阅读：YOLO

最新推荐文章于 2024-03-01 12:43:57 发布

贾小树

最新推荐文章于 2024-03-01 12:43:57 发布

阅读量365

点赞数

分类专栏：论文阅读目标检测

本文链接：https://blog.csdn.net/j879159541/article/details/100566705

版权

论文阅读同时被 2 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标检测

45 篇文章 1 订阅

订阅专栏

文章目录

1、网络总述
2、yolo鲁棒性强
3、置信度与类别置信度
4、损失函数
5、一个grid cell中是否有object怎么界定？
6、yolo与其他网络的比较
7、yolo与Fast RCNN模型集成 ensembles
8、模型的泛化能力
9、NMS对yolo的帮助
10、各网络性能对比
11、错误分析工具
参考文献

1、网络总述

在这里插入图片描述
网络结构的设计参考了GoogleNet，在卷积层后面加了几个全连接层进行预测。yolo在FasterRCNN之后发表但在SSD之前发表，网络中的一些数据集上的性能主要是和Fast RCNN 以及DPM比较，可能是当时FasterRCNN还没有开源，yolo是ONE STAGE检测，实现了e2e的训练与测试，并且实时，但准确率一般。

yolo只利用了一个feature map，且没有用sliding window或者region proposals的方式产生BB，而是在在feature map的每个单元格后面设置B个bounding boxes，

注意： yolo的box的大小和长宽不固定，不像anchors那样，框的中心是相对于grid cell的偏移，宽高是相对于原图的的比例，都在0到1之间，大小是训练时随机产生的，训练时根据GT来进行回归，maybe也是因为这样，所以yolo在检测目标时是根据图片的全局信息来检测的，也因此对背景的误检比较少，但定位误差也大。原文如下：

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.

还有就是Yolo对于在物体的宽高比方面泛化率低，就是无法定位不寻常比例的物体，因为它的宽高是在训练数据集中学到的。

B在论文里一般设置为2，输入为448448，feature map为77，每个BB预测5个数， x y w h外加一个置信度，但是C个类别的概率为每个gird cell 预测一组，而不是每个BB一组，所以导致每个grid cell只能预测一个类别，所以yolo对群体性的小目标检测效果不好，如一群小鸟。

感觉yolo的这个置信度之后再预测类别的操作有点类似于Faster RCNN，Faster RCNN的RPN是先找出有可能是目标的proposals，并没有预测类别，到了后面的Fast RCNN才预测类别。

2、yolo鲁棒性强

与DPM和RCNN相比，将测试数据集换掉，yolo的检测效果也很好。原文如下：

Third,YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected input.

3、置信度与类别置信度

在这里插入图片描述
confidence置信度包含两部分：1是否包含目标 2是框的准确率，用IOU来表示

类别置信度就包含三部分了，这个是gird cell才有，置信度是每个BB都有

最终预测的tensor为：
在这里插入图片描述

4、损失函数

在这里插入图片描述
损失函数利用的是方差函数，为了平衡各部分损失，加了权重，

为了平衡差异对小box和大box的影响，w h开了平方根，

5、一个grid cell中是否有object怎么界定？

首先要明白grid cell的含义，以文中77为例，这个size其实就是对输入图像（假设是224224）不断提取特征然后sample得到的（缩小了32倍），然后就是把输入图像划分成7*7个grid cell，这样输入图像中的32个像素点就对应一个grid cell。回归正题，那么我们有每个object的标注信息，也就是知道每个object的中心点坐标在输入图像的哪个位置，那么不就相当于知道了每个object的中心点坐标属于哪个grid cell了吗，而只要object的中心点坐标落在哪个grid cell中，这个object就由哪个grid cell负责预测，也就是该grid cell包含这个object。另外由于一个grid cell会预测两个bounding box，实际上只有一个bounding box是用来预测属于该grid cell的object的，因为这两个bounding box到底哪个来预测呢？答案是：和该object的ground truth的IOU值最大的bounding box。

6、yolo与其他网络的比较

MultiBox还需进一步分类，而yolo已经是一个完整的目标检测系统了；OverFeat更多的是定位而不是检测，没有利用上下文信息。
注：这两个网络还没看过，需要进一步阅读论文，还有SPPnet。

7、yolo与Fast RCNN模型集成 ensembles

在这里插入图片描述
yolo定位误差大，但背景错误少；而Fast RCNN定位误差小，但背景错误多，结合两者优势进行模型集成，原文如下：

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from FastR-CNN we get a signiﬁcant boost inperformance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes. The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%

8、模型的泛化能力

搞学术的对模型性能的评价：1是训练集 2是测试集，因为是同一个数据集，这两者的数据分布一般是一样的，所以感觉性能都很好，但在实际应用时，模型好多场景都没见过，即使是同一种目标，由于数据分布不一样，性能也有可能下降。

论文在4.5节介绍了模型泛化能力，比较了三种模型RCNN 和DPM 以及YOLO，以下为原文介绍：

YOLO outperforms other detection methods across the board. It has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork.
R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. This suggests that R-CNN is highly overﬁt to PASCAL VOC. RCNN uses Selective Search for bounding box proposals which is highly tuned for natural images. The classiﬁer step in R-CNN only sees small regions and is dependant on getting good proposals from Selective Search.
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it also starts from a lower AP.
YOLO has high performance on VOC 2007 and it generalizes well. Like DPM, YOLO models the size and shape of objects. Since it looks at the whole image, it also models relationships between objects and where objects commonly appear in scenes. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.

RCNN过拟合了，因为用的SS方法，模型泛化能力比较差；
DPM相对来说泛化能力很好
YOLO泛化是最好的，因为yolo检测是根据整张输入图的特征来进行推理的，还模型化了目标的形状以及目标与目标之间的关系。

9、NMS对yolo的帮助

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to ﬁx these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 23% in mAP.