yolo v1 论文深度学习及问题解答

迷你可可小生

于 2024-12-15 15:27:58 发布

阅读量899

点赞数 28

文章标签： YOLO 深度学习人工智能

本文链接：https://blog.csdn.net/dnnjhh/article/details/144298852

版权

引言

yolo 是2015年提出的关于目标检测的算法。YOLO系列的核心思想就是把目标检测转变为一个回归问题，利用整张图片作为网络的输入，通过神经网络，得到边界框的位置及其所属的类别。本文对每一个模块进行细致解读，并思考了一些问题。因为论文许多人都解读过，因此本文最精彩的部分在于本人对论文时候突发的一些思考，并给与解读，可能会对你h有帮助。

2.4. Limitations of YOLO

3. Comparison to Other Detection Systems

4. Experiments

4.1. Comparison to Other Real-Time Systems

4.2. VOC 2007 Error Analysis

4.3. Combining Fast R-CNN and YOLO

4.4. VOC 2012 Results

4.5. Generalizability: Person Detection in Artwork

5. Real-Time Detection In The Wild

6. Conclusion

7.Extending Knowledge

1.selective search

2.SVM

3.NMS

4.RCNN,Fast-RCNN,Faster R-CNN ,Faster R-CNN ZF

8.Question

Abstract

We present YOLO, a new approach to object detection.
Prior work on object detection repurposes classifiers to per-
form detection. Instead, we frame object detection as a re-
gression problem to spatially separated bounding boxes and
associated class probabilities. A single neural network pre-
dicts bounding boxes and class probabilities directly from
full images in one evaluation. Since the whole detection
pipeline is a single network, it can be optimized end-to-end
directly on detection performance.

Our unified architecture is extremely fast. Our base
YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations ofobjects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork

就是说yolo是一个新的目标检测方式，然后介绍了以前的工作是在分类的基础上（两步走）完成目标检测。本文将分类和边框预测作为一个回归问题。然后说yolo框架的统一性使系统的检测速度非常快。基础有45桢，fast是155。精度是其他实时目标检测的两倍。对比与其他是定位错误多一些，背景误判少。最后讲通用性强。然后是艺术作品识别。

map ： mAP（mean Average Precision）在机器学习中的目标检测领域，是十分重要的衡量指标，用于衡量目标检测算法的性能。一般而言，全类平均正确率（mAP，又称全类平均精度）是将所有类别检测的平均正确率（AP）进行综合加权平均而得到的。

Tow stage：先进行区域生成，该区域称之为region proposal（简称RP，一个有可能包含待检物体的预选框），再通过卷积神经网络进行样本分类。任务流程：特征提取 –> 生成RP –> 分类/定位回归。常见tow stage目标检测算法有：R-CNN、SPP-Net、Fast R-CNN、Faster R-CNN和R-FCN等。

One stage：不用RP，直接在网络中提取特征来预测物体分类和位置。任务流程：特征提取–> 分类/定位回归。常见的one stage目标检测算法有：OverFeat、YOLOv1、YOLOv2、YOLOv3、SSD和RetinaNet等。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

人的视觉一览无遗，稍加思索就能从事驾驶工作。然后提到快速准确的算法不需要传感器驾驶汽车，能给人传递实时信息。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image

说当前方法，两阶段，分类后检测，做起来工作量大，他提到以前的DPM使用滑动窗口，（使用不同的滑块遍历图像，找到潜在的可能的候选区，然后进行定位和分类。）

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

他们用候选区的方式进行，selective search，（颜色，纹理，填充，尺度进行相似度分析），NMS（把若干个候选框挑个最好的），然后说这个过程太慢了。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

就说我们的yolo两步变一步，我们重新把目标检测做成一个回归问题，用这个系统，看一眼就行。

输出和输入只有一个网络，端到端的，只有一个简单的卷积网络。yolo它训练的是整幅图像，过去是先找出一些目标，然后拿给CNN识别。yolo是站在全局角度。

First, YOLO is extremely fast. Since we frame detection
as a regression problem we don’t need a complex pipeline.
We simply run our neural network on a new image at test
time to predict detections. Our base network runs at 45
frames per second with no batch processing on a Titan X
GPU and a fast version runs at more than 150 fps. This
means we can process streaming video in real-time with
less than 25 milliseconds of latency. Furthermore, YOLO
achieves more than twice the mean average precision of
other real-time systems. For a demo of our system running
in real-time on a webcam please see our project webpage:
http://pjreddie.com/yolo/.

优点1，用训练好的网络，测试时直接结果就出来了。就是训练好的权值可以直接拿来用，然后又介绍了在不到25毫秒的延迟内，可以实时的处理流媒体视频，精度是其他视频的两倍，并且开源。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contex- tual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes back- ground patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

优点2，全局性，它对比滑动窗口和候选框，能对关键信息进行编码。说fast rcnn是过去的标杆，但是它背景误判为前景。只见树叶不见森林，看不到全局。

Third, YOLO learns generalizable representations of ob- jects. When trained on natural images and tested on art- work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly gen- eralizable it is less likely to break down when applied to new domains or unexpected inputs.

优点3，在自然图像上训练，在艺术图像上测试。表现远远超过这些 DPM,R-CNN。由于其具有较高的泛化性，碰到意外输入时不容易崩溃。

遗憾：准确度低，小对象识别不足，例如鸟群。滑动可以找2000个框，yolo只能找98个，且都比较大。

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an im- age simultaneously. This means our network reasons glob- ally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real- time speeds while maintaining high average precision.

我们将目标检测的分类和定位集成到一起。我们的网络使用整个图像的特征来预测每个边界框。它还同时预测图像所有类别的所有边界框。这意味着我们的网络对完整图像和图像中的所有对象进行全局推理。 YOLO 设计可实现端到端训练和实时速度，同时保持较高的平均精度。

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. For- mally we define confidence as Pr(Object) ∗ IOUtruth pred . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

将图像分为S*S个格，实际上是7行7列49个块，对象检测中心落入格子，这个格子就负责检测该物体。每一个网格单元都有B个候选框，一个置信度。置信度反应两个问题：是否包含对象（0，1）、准确度怎么样？Pr(Object)为0，预测的box没有目标，就是背景，若为1则有目标，置信度定义为预测框与GT框交并比IOU。

Each bounding box consists of 5 predictions: x, y, w, h,and confidence 。

x,y是中心点，宽度是w，高度是h，然后置信度是与真实位置的变化。最后要将x,y进行归一化。

Each grid cell also predicts C conditional class proba- bilities, Pr(Classi|Object). These probabilities are condi- tioned on the grid cell containing an object. We only predictone set of class probabilities per grid cell, regardless of the number of boxes B.

每个单元格预测C个类别。yolo使用VOC数据集，就有20个类别。每一个格要给出20个类的概率，这20个被A、B两个边框共享。

最后提到类别的计算公式。

分数同时体现了两个值，特定类出现在cell的概率以及边框的拟合程度。这体现了回归。

2.1.Network Design

We implement this model as a convolutional neural net- work and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convo- lutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

设计了一个CNN，分为卷积层（提取特征）和全连接层（分类）。它参考了Google-Net,有24个卷积层和以及2个全连接层，但是没有Inception,而是选用了1*1对层降维这么一个操作。

为什么输出是7*7*30，是因为yolo会把图像划分为7行7列，每个格子有30个值。

1*1 原先是W*H*D处理之后变为W*H*1，确保卷积核一致

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

还提到我们做了一个fast版本，进行更快的检测，快的原因使用更少的卷积层和滤波器。其他不变。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 valida- tion set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

我们在ImageNet上进行了预训练，使用了前20个卷积层加均值池化和全连接层？ 然后训练了一周，获得了top5 88%的准确率。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected lay- ers to pretrained networks can improve performance [29]. Following their example, we add four convolutional lay- ers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual infor- mation so we increase the input resolution of the network from 224 × 224 to 448 × 448.

接下来告诉我们调整模型用于检测，按照Ren的经验，我们加入了卷积层，全连接层。然后使用了初始化的权重。检测系统通常需要细粒度信息（粗粒度是汽车，细粒度是宝马，奥迪），因为我们把输入进行了调整。由224*224，变为448*448.

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell loca- tion so they are also bounded between 0 and 1.

归一化，见上文。

激活函数

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to op- timize, however it does not perfectly align with our goal of maximizing average precision. It weights localization er- ror equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We
use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = 0.5.

我们对模型使用平方和误差的方式进行优化，为什么要用它，因为它方便。缺点是在衡量定位错误时用到了和分类错误一样的参数。他是不平衡的。还有有一个问题，有个单元格有的有对象，有的没对象也导致了不平衡。

改进：增加了bounding box预测框的损失值，不包含的减少。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

这个是说平方和误差看待了大对象。在大图像里的小偏差比小图像里的小偏差更重要。为什么要两个B\C进行比较？

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

yolo在每个cell中预测了B个bbox，训练时我们仅需要一个BBOX , 最大IOU留下，这就会导致BBOX特异化，每个bbox负责预测不同的大小、尺度、类。从而提到了召回率。

损失函数

看损失函数的公式，为什么类的loss不用交叉熵，而是用MSE，

置信度的真实值怎么去算？

包含对象的置信前为什么没有λcoord。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the con- ditional class probability discussed earlier). It also only pe- nalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

只有当该预测器某个GT box 负责时，才会对他的bbox坐标误差进行惩罚，也就是通过IOU比较留下的那个计算，被抛弃的就不管了。反应到损失函数上就是AB只计算其中一个误差，对应蓝色区域。

只有当网格单元存在对象时，才会对分类错误进行惩罚，对应绿色区域。

超参数

We train the network for about 135 epochs on the train- ing and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005. Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−3 to 10−2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs. .

在PASCAL VOC数据集，测试集训练了135个epoch,batch size是64，动量是0.9，加惯性，遇到小的上坡别停，防止找到的是局部最小值。权重衰减，防止过拟合，

学习策略：最开始学习率从0.01缓慢变到0.001.（新值=原值-学习率*梯度）

Drop out

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the im- age by up to a factor of 1.5 in the HSV color space

核心就是引入噪声，打破不显著的偶然模式，防止模型记住偶然的模式。

数据增强

实用性高，光度变换，几何变换。调整对比度，色相，饱和度

2.3. Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, un- like classifier-based methods.

The grid design enforces spatial diversity in the bound- ing box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects nearthe border of multiple cells can be well localized by multi- ple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2- 3% in mAP.

推断：和我们训练过程想像，用到98个候选框，及每个框的类概率。格子保证了空间的多样性，通常，物体属于哪个网格单元是很清楚的，并且网络只为每个物体预测一个框。但是一些大的对象或者靠近边缘，那么会出现许多个cell预测同一个，出现重复了，所以用到NMS（赢者通吃）去抑制。提高精度。

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint lim- its the number of nearby objects that our model can pre- dict. Our model struggles with small objects that appear in groups, such as flocks of birds.

yolo对bbox加了很强的空间约束。空间约束限制了模型可以预测的临近物体的数量，对小物体预测能力差。比如成群结队的鸟群。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses rela- tively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

模型从数据中学习，预测bbox，因此，他很难泛化到新的，不常见的纵横比对象，见过的学的好，没见的不能很好识别。由于模型包含大量的下采样（减少运算量）。（伴随池化，小目标可能会丢失）模型使用的特征可能会粗糙。

Finally, while we train on a loss function that approxi- mates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

损失函数对大box和小box的处理不一致。但是大预测框的小错误是良性的，小预测框的小错对IOU的影响越要大很多。（同意差10个像素，小目标敏感）

主要的错误来源是定位不准确，1、其他错误少，将背景识别为具体的类别，别的看的时候没有参考。

3. Comparison to Other Detection Systems

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole im- age or on some subset of regions in the image [35, 15, 39]. We compare the YOLO detection system to several top de- tection frameworks, highlighting key similarities and differ- ences.

首先去提取特征，随着进步，我们找到了不同方法，前CNN时代(Haar [25], SIFT [23], HOG [4],低级。进入CNN时代，提取卷积特征，。然后分类，早期采用滑动窗口，暴力破解，后面采用selective search（缺点只能看到一小块，不能看到全局） .机器学习。固定套路。yolo与顶尖的目标检测系统比较RCNN,DPM,OVERFIT。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non- maximal suppression, and contextual reasoning all concur- rently. Instead of static features, the network trains the fea- tures in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

DPM（可变型部件模型，对于多姿态的人，只用站立模型并不准确提取，做一些局部模型）用到滑动窗口模式，提取静态特征，对区域分类，预测得分高的边界框，我们用到了连接的一个网络，两步变一步。网络执行了特征提取，框的预测，非极大值抑制，上下文推理。

R-CNN. R-CNN and its variants use region proposals in- stead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolu- tional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max sup- pression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

更快的R-CNN专注于通过共享计算和使用神经网络提出区域而不是选择性搜索[14] [28]来加速R-CNN框架。尽管他们对R-CNN提供了速度和准确性的提高，但两者仍然没有实时性能。

YOLO与R-CNN有一些相似之处。每个网格单元提出潜在的边界框并使用卷积特征对这些框进行评分。然而，我们的系统对网格单元的候选框施加空间限制，这有助于缓解对同一目标的多次检测的问题。我们的系统还生成了更少的边界框，每张图像只有98个，而选择性搜索则有约2000个。最后，我们的系统将这些单独的组件(individual components)组合成一个单一的、共同优化的模型。

许多研究工作着重于加速DPM管道[31] [38] [5]。他们加快了HOG计算，使用级联反应并将计算推向GPU。但是，实际上只有30Hz DPM [31]实时运行。执迷于传统区域，硬件。没有进化，新时代.yolo从设计层面就领先了。艺术品方面可能与真实取得。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computa- tion and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.

Deep MultiBox。虽然也能执行目标检测，但是是单目标，然而不能执行通用的目标检测。是整个检测系统里的一小块。yolo是一个完整的系统

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, Multi- Box cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further im- age patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an im- age but YOLO is a complete detection system.

他们也训练了一个神经网络来执行定位，他们有效的执行的滑动窗口的检测，但他任然不是one stage.性能没得到提升。只看到局部信息，不能分析全局信息，还需要一些后处理。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs slid- ing window detection but it is still a disjoint system. Over- Feat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to pro- duce coherent detections.

在他自己工作基础上，之前的工作MultiGrasp 只需要预测包含一个对象的图像的单个可抓取区域。它不必估计物体的大小、位置或边界，也不必预测它的类别，只需找到适合抓取的区域即可。现在的YOLO 预测图像中多个类的多个对象的边界框和类概率。识别多目标检测，

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multi- ple classes in an image.

4. Experiments

将YOLO与其他实时监测系统进行比较，为了理解YOLO与R-CNN个变体之间的差异，我们探讨了 YOLO 和 Fast R-CNN（性能最高的版本之一）在 VOC 2007 上产生的错误。

基于不同的错误特征，我们表明 YOLO 可用于重新对 Fast R-CNN进行评估，可以有效检测并减少背景误报的错误，从而显着提高性能。

也在2012上进行精度、和当前性能比较好的方法比较。

最后展示了YOLO的泛化效果。在两个艺术品上展示了效果。

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differ- ences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the er- rors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.

4.1. Comparison to Other Real-Time Systems

许多研究员致力于提高检测速度，然而，只有 Sadeghi 等人。实际上产生了一个实时运行的检测系统（30fps）,我们将 YOLO 与 DPM 的 30HZ,100HZ，GPU 实现进行比较，其他没有实时效果的版本，比较了它们的相对 mAP 和速度。

Many research efforts in object detection focus on mak- ing standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a de- tection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU imple- mentation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object de- tection systems.

实时检测，比快的话没有fast yolo好，精度没有YOLO好。fast YOLO 是 PASCAL VOC 检测记录中最快的检测器，并且其准确度仍然是任何其他实时检测器的两倍。 YOLO 比快速版本准确 10 mAP，同时速度仍远高于实时。非实时也进行了比较。

Fast YOLO是PASCAL上最快的物体检测方法。mAP 为 52.7%，比之前的实时检测准确率提高了一倍多。 YOLO 将 mAP 提升至 63.4%，同时仍保持实时性能。

VGG-16 训练 YOLO。该模型更准确，但也比 YOLO 慢得多。

尽管最快的 DPM 有效地加速了 DPM，而不会牺牲太多 mAP，但它仍然会导致实时性能下降 2 倍。站在历史节点上，方法过失，被淘汰。

R-CNN Minus R 用静态边界框提案替换选择性搜索 [20]。使用固定的框，会比RCNN更快，依旧达不到实时性，而且精度下降多。

Fast-R-CNN 它具有较高的 mAP,但它仍然依赖于选择性搜索，每张图像大约需要 2 秒才能生成边界框建议.

Faster R-CNN 用神经网络取代选择性搜索来提出边界框，在我们的测试中，他们最准确的模型达到 7 fps，而较小、不太准确的模型则以 18 fps 运行。 Faster R-CNN 的 VGG-16 版本比 YOLO 高 10 mAP，但也慢 6 倍。

ZeilerFergus Faster R-CNN 仅比 YOLO 慢 2.5 倍，但准确度也较低

4.2. VOC 2007 Error Analysis

To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast R- CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly avail- able.

我们将 YOLO 与 Fast RCNN 进行比较，因为 Fast R-CNN 是 PASCAL 上性能最高的检测器之一，并且它的检测结果是公开的。

对于测试时的每个类别，我们查看该类别的前 N 个预测。每个预测要么是正确的，要么根据错误类型进行分类：

YOLO 中的定位错误所造成的错误比所有其他来源错误的总和还多。Fast R-CNN 的定位错误要少得多，但背景错误要多得多。Fast R-CNN 预测背景检测的可能性几乎是 YOLO 的 3 倍。

4.3. Combining Fast R-CNN and YOLO

通过使用 YOLO 消除 Fast R-CNN 的背景检测，我们的性能得到了显着提升。Fast R-CNN 提出了2000个候选框， YOLO提出了98个，有很高的IOU，给很高的权重。若都不相似，给更低的权重值。

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detec- tions from Fast R-CNN we get a significant boost in perfor- mance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability pre- dicted by YOLO and the overlap between the two boxes.

基于YOLO的提升并不是简单的集成，而是优势互补。

就是说可以用yolo优化Faster R-CNN,没办法用Faster R-CNN优化yolo

4.4. VOC 2012 Results

我们在2012测试集，展示了每个类的精度。YOLO得到了57.9的精度，这低于当前最先进的水平，更接近使用 VGG-16 的原始 R-CNN。在瓶子、羊和电视/显示器等类别上，YOLO 的得分比 R-CNN 或特征编辑低 8-10%Fast RCNN 。在猫和火车等其他类别上，YOLO 取得了更高的性能。

我们这个系统艰难奋战在小目标检测上，我们的 Fast R-CNN + YOLO 组合模型是性能最高的检测方法之一。

4.5. Generalizability: Person Detection in Artwork

Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen be- fore [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.

学术研究上，我们测试机和训练集具有相同的分布，实际上可能有所不同，训练的是真人，但是测试的是艺术品的人，我们在毕加索数据集 [12] 和人物艺术数据集 [3] 上将 YOLO 与其他检测系统进行比较，这两个数据集用于测试艺术品上的人物检测。

R-CNN 在 VOC 2007 上具有较高的 AP。然而，当应用于艺术品时，R-CNN 大幅下降。

它使用SS搜索，针对自然图像，没有把能力迁移到艺术品上去。

DPM 在应用于艺术品时可以很好地保持其 AP，考虑了空间结构，本来就低，掉的幅度就小

5. Real-Time Detection In The Wild

YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance,including the time to fetch images from the camera and dis- play the detections.

YOLO 是一种快速、准确的物体检测器，非常适合计算机视觉应用。我们将 YOLO 连接到网络摄像头并验证它是否保持实时性能，

6. Conclusion

We introduce YOLO, a unified model for object detec- tion. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.

Fast YOLO is the fastest general-purpose object detec- tor in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.

1.YOLO 是ONE Stage

2.训练了整幅图像。

3.YOLO是在直接对应检测性能的损失函数上训练，是端对端的。

4.Fast YOLO地表最强

5.YOLO是实时检测的标杆，

6.YOLO是快速的，鲁棒性的通用检测模型。

7.Extending Knowledge

1.selective search

算法流程，分块，然后线性排列计算他们之间的相似性，最高相似性的合并然后计算合并后相邻的相似性，

和超图类似？主要目标确定层级关系。

合并块，相似性越高合到一块，采用多样性策略，颜色相似，纹理，调整阈值

2.NMS

NMS算法主要解决的是一个目标被多次检测的问题，意义主要在于在一个区域里交叠的很多框选一个最优的。

1）对于上述的98列数据，先看某一个类别，也就是只看98列的这一行所有数据，先拿出最大值概率的那个框，剩下的每一个都与它做比较，如果两者的IoU大于某个阈值，则认为这俩框重复识别了同一个物体，就将其中低概率的重置成0。

（2）最大的那个框和其他的框比完之后，再从剩下的框找最大的，继续和其他的比，依次类推对所有类别进行操作。注意，这里不能直接选择最大的，因为有可能图中有多个该类别的物体，所以IoU如果小于某个阈值，则会被保留。

（3）最后得到一个稀疏矩阵，因为里面有很多地方都被重置成0，拿出来不是0的地方拿出来概率和类别，就得到最后的目标检测结果了。

注意： NMS只发生在预测阶段，训练阶段是不能用NMS的，因为在训练阶段不管这个框是否用于预测物体的，他都和损失函数相关，不能随便重置成0。

原文链接：https://blog.csdn.net/weixin_43334693/article/details/129011644

8.Question

8.1 YOLO 将目标检测问题看为回归模型主要体现在？

在训练过程中，已知ground truth的前提下，建立一个从特征图到坐标和置信度等参数的一个回归方程，简单记为y=f(x)，f(x)是一个多项式，对于每一个参数（坐标、置信度）都会有一个系数，而网络要学习的就是这些参数。通过大量的数据以及构建的损失函数，最后得到一个完美的拟合曲线f(x)，然后将这个拟合函数用来测试。

8.2 YOLO图像是怎样与处理的？

图像尺寸调整：YOLO模型对输入图像的尺寸有要求，通常要求图像的宽度和高度是32的倍数。因此，在输入图像之前，需要将图像的尺寸调整为符合要求的大小。经过池化stride后会变换特征图大小一共有5次，每次都是2倍所以确定imgsize大小要是2^5=32倍数。
图像归一化：为了使模型对不同图像的处理具有一致性，通常会对图像进行归一化处理。常见的归一化方式是将图像的像素值除以255，将像素值缩放到0到1之间。
图像增强：为了提高模型的鲁棒性和泛化能力，可以对图像进行增强操作，如亮度调整、对比度增强、图像旋转等。这些操作可以增加模型对不同场景和光照条件下物体的识别能力。
图像填充：当图像的宽高比与模型要求的宽高比不一致时，可以对图像进行填充操作，将图像调整为符合要求的宽高比。常见的填充方式是在图像的边缘填充0像素。
图像通道调整：YOLO模型通常要求输入图像的通道数为3，即RGB图像。如果输入图像的通道数不符合要求，需要进行通道调整操作，将图像的通道数调整为3。

8.3 为什么每个网格有B个边界框,每个网格的边界框怎么得到的？

在训练的时候会在线地计算每个predictor预测的bounding box和ground truth的IOU，计算出来的IOU大的那个predictor，就会负责预测这个物体。

YOLO中两个bounding box是人为选定的(2个不同长宽比）的box，在训练开始时作为超参数输入bounding box的信息，随着训练次数增加，loss降低，bounding box越来越准确。

8.4 为什么要划分为7*7的模式？

7x7是因为这种划分方式能够在保持计算效率的同时，提供足够的精度来检测图像中的目标。

8.5 使用1*1卷积的作用

在3×3的卷积后通常会接一个通道数更低1×1的卷积，这种方式既降低了计算量，同时也提升了模型的非线性能力

8.6 损失函数为什么对预测边界框宽度和高度的平方根？

在上图中，大框和小框的bounding box和ground truth都是差了一点，但对于实际预测来讲，大框（大目标）差的这一点也许没啥事儿，而小框（小目标）差的这一点可能就会导致bounding box的方框和目标差了很远。而如果还是使用第一项那样直接算平方和误差，就相当于把大框和小框一视同仁了，这样显然不合理。而如果使用开根号处理，就会一定程度上改善这一问题。