YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection

最新推荐文章于 2024-07-04 13:57:08 发布

CUHK-SZ-relu

最新推荐文章于 2024-07-04 13:57:08 发布

阅读量1.5k

点赞数 2

文章标签：深度学习目标检测

本文链接：https://blog.csdn.net/qq_43210957/article/details/118754476

版权

0.Abstract

0.1原文翻译

第一段（说明本次研究和之前研究的区别）

We present YOLO, a new approach to object detection.
提出了一种新的目标检测方法YOLO。

Prior work on object detection repurposes classifiers to perform detection.
先前的对象检测工作使分类器重新进行检测。（也就是所谓的需要进行两次检验）

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
相反，我们将目标检测作为一个回归问题，用来完成空间分离的边界框和相关的类概率。（这里大约就是说，用一张图片这个很多内容，直接回归出来一个目标检测的结果，这里的所谓目标检测结果：其实只是一个S×S×B的（x、y、w、h、c）和（分类的概率张量）这里前面S×S×B只是一个个数，后面的是真正的回归结果。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
一个单一的神经网络预测边界盒和类概率直接从完整的图像在一次评估。

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
由于整个检测管道是一个单一的网络，它可以端到端直接对检测性能进行优化。
（我理解这里的pipline是原有的架构确定的一种描述，以前的计算图都是固定的）
（end-to-end的大约意思就是直接从一张图就可以得到一张预测图）

第二段（速度快、虽然错误率高一点，但是背景被错误标记的概率更低）

Our unified architecture is extremely fast.
我们的架构非常快
Our base YOLO model processes images in real-time at 45 frames per second.
我们的基本YOLO模型以每秒45帧的速度实时处理图像。

A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
一个更小的网络，Fast YOLO，处理速度惊人的155帧每秒，同时仍然实现了两倍于其他实时探测器的mAP。

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.
与最先进的检测系统相比，YOLO的定位误差更大，但预测背景误报的可能性更小。

Finally, YOLO learns very general representations of objects.
最后，YOLO学习非常一般的对象表示。

It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
它比其他检测方法，包括DPM和R-CNN，从自然图像泛化到其他领域，如艺术品。

0.2总结

1.只用看一次。
2.因为只用看一次所以处理速度快。
3.因为只看了一次所以总的错误率稍高一点。
4.虽然总的错误率高，但是在背景处理上犯错少啊。

1. Introduction

1.翻译

第一段（研究具有意义：1.这个和人类似2.这个可以辅助解决很多实际问题）

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact.
人们瞥一眼图像，就能立即知道图像中的物体是什么，它们在哪里，以及它们是如何互动的。（人看一眼就知道：东西在哪和运动趋势就知道了）

The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought.
人类的视觉系统是快速和准确的，允许我们执行复杂的任务，如驾驶很少有意识的思考。

Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
快速、准确的目标检测算法将使计算机在不需要专门传感器的情况下驾驶汽车，使辅助设备能够向人类用户传递实时场景信息，并为通用、灵敏的机器人系统释放潜力。

第二段（分成proposal和classification两步进行的缺陷）

Current detection systems repurpose classifiers to perform detection.
当前的检测系统重新利用分类器来执行检测。

To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
为了检测一个对象，这些系统为该对象选取一个分类器，并在测试图像的不同位置和尺度上对其进行评估。

Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
像可变形部件模型(DPM)这样的系统使用滑动窗口方法，其中分类器在整个图像[10]上均匀间隔的位置上运行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes.
最近的一些方法，如R-CNN，使用区域建议方法首先在图像中生成潜在的边界框，然后在这些被建议的框上运行分类器。（之前全部扫描一次，这里是部分扫描一次，选定一些可能的区域）

After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13].
分类完成后，使用后处理对边界框进行细化，消除重复检测，并基于场景中的其他物体对边界框进行重新计算分数。[13]

These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
这些复杂的管道缓慢且难以优化，因为每个单独的组件都必须单独训练。（这就像之前的PointNet一样必须再中间的过程中优化一个标签的生成）

第三段（YOLO的优势）

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
我们将目标检测作为一个单一的回归问题，直接从图像像素到包围盒坐标和类概率。

Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
使用我们的系统，你只需看一幅图像(YOLO)，就能预测出存在哪些物体以及它们的位置。

第四段（YOLO可以直接得到结果）

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
YOLO非常简单:参见图1。一个卷积网络可以同时预测多个边界框和这些边界框的类概率。

YOLO trains on full images and directly optimizes detection performance.
YOLO对全图像进行训练，直接优化检测性能。

This unified model has several benefits over traditional methods of object detection.
与传统的目标检测方法相比，这种统一的模型有几个优点。

第五段（首先，YOLO的实时性，并且别别的实时系统强很多）

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.
首先，YOLO非常快。因为我们把检测作为一个回归问题，我们不需要一个复杂的管道。

We simply run our neural network on a new image at test time to predict detections.
我们只是在测试时对新图像运行我们的神经网络来预测检测结果。

Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps.
我们的基本网络运行速度是每秒45帧，在Titan X GPU上没有批处理，而快速版本运行速度超过150帧。

This means we can process streaming video in real-time with less than 25 milliseconds of latency.
这意味着我们可以以小于25毫秒的延迟实时处理流媒体视频。

Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems.
此外，YOLO的平均精度是其他实时系统的两倍以上。

For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
有关我们的系统在网络摄像头上实时运行的演示，请查看我们的项目
:http://pjreddie.com/yolo/

第六段（第二，YOLO考虑全局的情况，因为考虑全局的信息，所以在背景上犯错犯的更少）

Second, YOLO reasons globally about the image when making predictions.
其次，在进行预测时，YOLO会考虑到全局的情况。

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.
与基于滑动窗口和区域建议的技术不同，YOLO在训练和测试期间查看整个图像，因此它隐式地对类及其外观的上下文信息进行编码。

Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context.
快速R-CNN是一种顶级检测方法[14]，它会将图像中的背景补丁误认为目标，因为它无法看到更大的背景

YOLO makes less than half the number of background errors compared to Fast R-CNN.
与Fast R-CNN相比，YOLO产生的背景错误不到前者的一半。

第七段（第三，泛化能力好）

Third, YOLO learns generalizable representations of objects.
第三，YOLO学习对象的概化表示。

When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin.
在自然图像上进行训练，在艺术品上进行测试时，YOLO的性能远远优于DPM和R-CNN等顶级检测方法。

Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
由于YOLO是高度一般化的，它在应用于新域或意外输入时不太可能出现故障。

第八段（还有不足：总体精度低，对小物体识别不好）

YOLO still lags behind state-of-the-art detection systems in accuracy.
YOLO在准确性方面仍然落后于最先进的检测系统。

While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
虽然它可以快速识别图像中的物体，但很难精确定位一些物体，尤其是小物体。

We examine these tradeoffs further in our experiments.
我们在实验中进一步研究了这些权衡。

第九段

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我们所有的培训和测试代码都是开源的。各种预训练模型也可以下载。

在这里插入图片描述
图1:YOLO检测系统。使用YOLO处理图像是简单而直接的。我们的系统(1)将输入图像的大小调整为448 × 448，(2)在图像上运行单个卷积网络，(3)根据模型的置信度对检测结果进行阈值。
这里的confidence很好，因为这个东西无所谓概率的问题，这里直接提一个confidence。

1.2总结

1.实时性是个好东西，很有用，处理图片数据更快很重要。
2.YOLO可以实现实时性。
3.YOLO因为考虑了全局的因素在背景识别上有很好的效果。
4.泛化能力好
5.但是在小物体的效果不好（作者也用比例优化了这个问题）

2. Unified Detection

2.0总体描述

2.0.1翻译

第一段（proposal和classification一次解决、全局因素考虑、端到端）

We unify the separate components of object detection into a single neural network.
我们将目标检测的各个部分统一成一个神经网络。（就是原来的proposal和classification直接合成一个了）

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.
我们的网络使用整个图像的特征来预测每个边界框。它还可以同时预测图像中所有类的所有边界框。这意味着我们的网络对完整图像和图像中的所有对象进行了全局分析。

The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
YOLO设计使端到端训练和实时速度，同时保持高平均精度。

第二段（预测结果的说明）

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我们的系统将输入图像划分为一个S × S网格。如果一个对象的中心落在一个网格单元中，该网格单元负责检测该对象。（这个正是这个YOLO最核心的一件事情，就是将整个图片转化成一个个grid，每个grid负责B个bounding box。但是这个B个bounding boxs却只有一组分类信息和她们对应，这就导致了这B个bounding box必须是一个类别。这就导致了YOLO在预测相互接近的不同类别的物体过程中表现不好。）

第三段（介绍模型的评价方式）

Each grid cell predicts B bounding boxes and confidence scores for those boxes.
每个网格单元预测B个边界框和这些框的置信值。（为什么预测这些边框，因为这些边框的中心落在了这个内容的中心。）

These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
这些信心分数反映了模型对盒子包含一个物体的信心程度，以及模型认为盒子对其预测的准确性。（这个盒子包含两件事一个是有没有东西一个是预测的对不对，其实，读下面就可以知道这里同时反映两件事是把这两个东西乘起来的。）

Formally we define confidence as Pr(Object) ∗ IOUtruth
在形式上，我们把信心定义为（这个东西我们可以看出来是两个东西相乘）

Pr（Object）如果这里有对象就是1没有就是0
IOU这个是个熟系的内容了。

If no object exists in that cell, the confidence scores should be zero.
如果该单元格中不存在任何对象，则置信度分数应为零。（让Pr变成0就是了）

Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
否则，我们希望置信度得分等于预测框和地面真理之间的并集(IOU)的交集。

第四段（预测结果的说明）

Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
每个边界框由5个预测组成:x, y, w, h和置信度。

The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
(x, y)坐标表示一个bounding box的中心，他的预测是和这个grid相互关联的。

The width and height are predicted relative to the whole image.
宽度和高的预测是结合整个图片进行预测的

Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
这里的confidence如果确实有不就是个IOU吗

第五段（详细讲预测结果当中分类的事情）

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object).
每个网格单元格还预测C条件类概率，Pr(Classi|Object)。
（这里算的概率是这里已经是一个object的概率）

These probabilities are conditioned on the grid cell containing an object.
这些概率取决于包含对象的网格单元格。

We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
我们在每个网格单元中只预测一组类概率，而不考虑盒子B的数量。
（这里B之前说过，也就是一个prid cell 当中包含的候选bounding box的数量）也就是说这B个bounding boxs必须是同一类这就导致了之后YOLOv1在预测相互靠近的不同类别的物体时很无力。

第六段（评价预测结果设计的逻辑合理性）

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在测试时，我们将条件类概率和单个框置信预测相乘，（也就是下面这个框，可以看到直接就得到了这个类的概率，作者在这大约也是想说明自己这个东西是符合数学规律的）
在这里插入图片描述
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
就给了我们每个箱子的特定类别的信心分数。这些分数既编码了类出现在方框中的概率，也编码了预测的方框与对象的匹配程度。

第七段（输出格式的说明）

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20.
Our final prediction is a 7 × 7 × 30 tensor.

图片内容Fig.2

2.0.2总结

1.大约说的就是使用一个网格（grid）来识别一部分detection，并且明确了各个的评价标准，这里比较特别的分类任务一般选择使用softmax最后生成一个分类个数的概率。这里直接生成一个detection的置信程度（confidence）来判定这个框是不是画对了。
2.所以这里的一切的展开都是围绕着一个grid展开的，并且每个grid虽然最后会评价B个框的confidence，但是最后在生成分类数据的时候，却只生成一组分类信息也就是如果有C个分类就生成C个维度，所以这也就导致了前面的B个框只能有一个是有效的。

2.1. Network Design

2.1.1逐句翻译

第一段（数据集和总体的模型提出）

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9].
我们将该模型作为一个卷积神经网络实现，并在PASCAL VOC检测数据集上对其进行评估。

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
网络的初始卷积层从图像中提取特征，而全连接层预测输出概率和坐标。（在FCN出来之前大家一般都是使用全连接做最后的转化。说起来这个东西其实学习能力很好，但是也很容易过拟合。）

第二段（更加详细的介绍网络）

Our network architecture is inspired by the GoogLeNet model for image classification [34].
我们的网络架构是受GoogLeNet图像分类模型的启发。

Our network has 24 convolutional layers followed by 2 fully connected layers.
们的网络有24个卷积层，然后是2个全连接层。

Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
与googleet使用的初始模块不同，我们简单地使用1 × 1的简化层，然后是3 × 3的卷积层，类似于Lin等人[22]。完整的网络如图3所示。

（这个1×1的卷积我觉得其具有比较突出的特征提取能力。作者这里是使用1×1的卷积扩展通道和降低通道。说这么多大家可能还是不懂，看代码其实你就可以发现作者是顺序使用了1×1、3×3、1×1来进行信息提取。

第三段（简单介绍Fast YOLO）

We also train a fast version of YOLO designed to push the boundaries of fast object detection.
我们还训练了一个YOLO的快速版本，旨在推动快速目标检测的边界。
（这里的push the boundaries大约就是说在推动这个领域的发展吧）

Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers.
Fast YOLO使用的神经网络具有更少的卷积层(9个而不是24个)和更少的层过滤器。
（过滤器：指的就是一个卷积层当中的不同卷积核，可以参考：卷积核个数和输入通道和输出通道个数的关系）

Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
YOLO和Fast YOLO除了网络大小不同外，所有训练和测试参数都是相同的。

第四段（介绍输出）

The final output of our network is the 7 × 7 × 30 tensor of predictions.
我们的网络的最终输出是预测的7 × 7 × 30张量。

2.1.2总结

就是介绍网络的输入，并说明输出的结构。
大约就是：

1.网络使用类似googleNet的结构（时代的局限性），提取一个7×7的特征图，每个像素就对应一个grid。（这里注意特征图当中的每个像素实际上只有当前这个grid当中的信息）
2.将每个像素的不同channel的信息permute到一起，之后引入全连接层来讲全局信息融入其中。（这里才能获得全局信息）
3.再permute回去进行信息提取。

2.2. Training

逐句翻译

第一段（大约就是讲作者怎么实现的）

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30].
我们在ImageNet 1000分类的比赛数据集上预训练我们的卷积层。[30]
（底层的预训练有助于提升网络整体的性能）

For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer.
为了进行预训练，我们使用图3中的前20个卷积层，然后是一个平均池化层和一个完全连接层。（浅层网络在之后的训练当中很难得到充分优化）

We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].
我们对这个网络进行了大约一周的训练，并在ImageNet 2012验证集上实现了88%的单一模型前5名的精度，这与Caffe的Model Zoo中的GoogLeNet模型相当。（大约就是预训练已经达到了很好的精度了）

We use the Darknet framework for all training and inference [26].
我们使用暗网框架进行所有的训练和推理[26]。（就是这个作者之前提的一个架构）

第二段（大约就是在原来预训练网络上增加层次和提升分辨率）

We then convert the model to perform detection.
然后，我们将模型转换为执行检测。

Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].
Ren等人表明，在预先训练的网络中同时添加卷积层和连接层可以提高性能。[29]

Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights.
根据他们的例子，我们添加了四个卷积层和两个权值随机初始化的完全连接层。
(这里的随机权重主要是区别于之前的经过预训练的权重)

Detection often requires fine grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
检测往往需要细粒度的视觉信息，因此我们将网络的输入分辨率从224 × 224提高到448 × 448。

第三段（将标准化）

Our final layer predicts both class probabilities and bounding box coordinates.
我们的最后一层预测类概率和边界盒坐标。

We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
我们通过图像的宽度和高度规范化边框的宽度和高度，使它们落在0和1之间。

We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我们将边界框的x和y坐标参数化为特定网格单元位置的偏移量，因此它们的边界也在0和1之间。

第四段（torch.nn.LeakyReLU()）

We use a linear activation function for the final layer and all other layers use the following leaky（漏的） rectified linear activation:
我们在最后一层使用线性激活函数，所有其他层使用以下泄漏修正线性激活:
（这个现在直接使用torch.nn.LeakyReLU()就可以创建了）
在这里插入图片描述

第五段（同时完成坐标误差和classification误差的evaluation）

We optimize for sum-squared error in the output of our model.
我们优化了模型输出中的和平方误差。
(sum-squared error整合localization error（bboxes的坐标误差）和classification error，因为是一次训练得到的坐标和分类，所以得直接把两者都evaluation了)

We use sum squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.
我们使用求和平方误差是因为它很容易优化，但是它并不完全符合我们最大化平均精度的目标。

It weights localization error equally with classification error which may not be ideal.
定位误差与分类误差的权重相等，这么评估误差可能并不理想。
（分类很多，所以放在一锅炖不合适）
Also, in every image many grid cells do not contain any object.
另外，在每个图像中，许多网格单元不包含任何对象。

This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects.
这会将这些单元格的“confidence”分数推向零，通常会压倒包含对象的单元格的梯度。
（就是其中不包含元素的内容太多了，如果全部一起评估，可能会出现问题就是不包含的object的grid得到了很好的优化，但是真正预测object的grid优化不充分）

This can lead to model instability, causing training to diverge early on.
这可能导致模型不稳定，导致培训在早期就出现分歧。

第六段（怎么实现这种不平均）

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects.
为了弥补这一点，我们增加了包围盒坐标（这里指的是对具体坐标的预测）预测的损失，并减少了不包含对象的盒的confidence预测的损失。

We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.
我们使用λ坐标和λnoobj两个参数来实现这一目标。我们设置λcoord = 5和λnoobj = .5。

第七段（均衡的大小格子的差异的问题）

Sum-squared error also equally weights errors in large boxes and small boxes.
平方和误差在大盒子和小盒子中的权重相等。

Our error metric（度量标准） should reflect that small deviations in large boxes matter less than in small boxes.
我们的误差度量应该反映出大盒子里的小偏差比小盒子里的小偏差更重要。

To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
为了部分解决这个问题，我们预测边界框宽度和高度的平方根，而不是直接预测宽度和高度。

（这里需要主要理解的是这样的一个问题：这里之前我们不是已经对weight和height作了标准化了吗？理论上他们的大小应该都在0到1之间，那么，为什么这里还要专门区分大小呢？
理解这个问题我们要理解几个事情：

1.这里说的标准化对应的是什么内容的标准化？
其实整个项目当中一共是使用了两次标准化：
Firstly，在处理输入数据集的时候，将输入数据的框的weight和height的大小转化到0到1之间，之后放入训练。
Second，对应的应当是输出结果，输出结果是0到1，之后经过反标准化，得到真正的结果。
2.这里计算损失的时候用的是什么内容？
用的是反标准化出发来的正常数据，按照作者自己的描述是，浮点数直接计算损失太大，所以需要转化为原来的数值计算损失。
3.所以这么转化过去又转化回来干啥呢？
转化过去为什么叫norm，因为这个东西和normalization的的作用几乎一样，想要直接估计差异很大的数值的时候，很难得到很好的结果，还是转换为0到1比较容易预测一点。
）

第八段（指定特定的预测器来进行预测）

YOLO predicts multiple bounding boxes per grid cell.
YOLO预测每个网格单元格有多个边界框。

At training time we only want one bounding box predictor to be responsible for each object.
在训练时，我们只希望一个边界盒预测器负责每个对象。

We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.
我们基于哪个grid预测的结果和被预测物体真值有最高的交并比，指定一个特定的预测器来负责当前这个对象的预测。
（这里的想法是很好的，但是有一个最大的问题就是，你在开始预测的时候怎么知道预测值，这就存在一个新的问题这样我们训练集要怎么传入的问题）

This leads to specialization between the bounding box predictors.
这导致了边界框预测器之间的专门化。

Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
每种预测器都能更好地预测物体的特定大小、长径比或类别，从而提高整体recall（也就是发现目标的能力）。

recall可以参考：什么是Precision和Recall？

第九段（介绍损失函数）

During training we optimize the following, multi-part loss function:
在training期间，我们优化了以下多部分组成的损失函数:
在这里插入图片描述

（这里之后细谈）

第十段（loss函数单独penalize一个因素）

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier).
请注意，loss函数只惩罚在该网格单元中存在对象的分类错误(因此前面讨论的是条件类概率)。
这里的解释一下说的什么，大约就是这里loss针对的只是分类的条件概率的错误，至于有没有object的错误，不是在条件概率这里体现的。他们五个是分的比较清楚的。

It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
它也只惩罚边界盒坐标误差，如果预测器（grid）“负责”对应的bounding box真值(即在网格单元中拥有最高的预测器IOU)。

也就是虽然这个事情是相关的事情，但是处理loss的时候，全是单独预测的，不相互纠结。

第十一段（训练实际操作的问题）

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012.
我们使用 PASCAL VOC 2007 and 2012的测试集合验证集，训练我们的网络135个epoch。

When testing on 2012 we also include the VOC 2007 test data for training.
在2012年的测试中，我们还包括了VOC 2007的测试数据。

Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
在整个训练过程中，我们使用64个批次，动量为0.9，衰减为0.0005。（记一下）

第十二段（学习率的选择）

在这里插入图片描述
一开始用比较小的学习率，之后再增大，再减小。因为一开始用的太大会受到一开始不稳定的梯度的影响。

第十三段（数据增强）

To avoid overfitting we use dropout and extensive data augmentation.
为了避免过拟合，我们使用了dropout和extensive data augment。（数据增强）

A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers .
在第一连接层之后，速率为= 0.5的dropout层阻止层之间的共同适应。

For data augmentation we introduce random scaling and translations of up to 20% of the original image size.
对于数据增强，我们引入了高达原始图像大小20%的随机缩放和平移。
translations 应当理解为平移
这里同时也是一个图像识别当中一个比较重要的思想，也就是平移不变性。

We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
在HSV颜色空间中，我们还随机调整了图像的曝光和饱和度，最高可达到1.5倍。

2.3. Inference

2.3.1逐句翻译

第一段（test也和train一样一次就出结果）

Just like in training, predicting detections for a test image only requires one network evaluation.
就像在训练中，预测检测测试图像只需要一个网络评估。

On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
在PASCAL VOC上，该网络预测每个图像的98个边界框和每个框的类概率。与基于分类器的方法不同，YOLO在测试时非常快，因为它只需要单个网络评估。

（这里可以类比理解一下torch的train状态和eval状态做出理解）

第二段（介绍一个极大抑制的问题）

The grid design enforces spatial diversity in the bounding box predictions.
网格设计加强了边界盒预测的空间多样性。（就是你什么样子的图只要能划分成grid不就完事了吗）

Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
通常情况下，对象归属于哪个网格单元是很清楚的，并且网络只对每个对象预测一个框。

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
然而，一些较大的物体或靠近多个单元边界的物体可以被多个单元很好地定位。

Non-maximal suppression can be used to fix these multiple detections.
非极大抑制可用于修复这些多重检测。
（这个东西就是选取其中最大的那个有效，剩下的哪些让他们抑制，这里理解一下，对于一个较大的物体，很有可能好几个grid都觉得自己可以预测这个较大物体的位置，所以这个物体就会有好几个，所以需要我们选定一个，这里就选定那个confidence最大的就好了）

While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
虽然非最大抑制并不像R-CNN或DPM那样对性能至关重要，但非最大抑制增加了2-3%的mAP（mean Average Precision）。
可以参考：

什么是mAP?

2.3.2总结

介绍：
1.训练和测试一样都很简单
2.非极大抑制，有的物体可能被很多框预测，所以不是confidence最大的框就会被抑制。

2.4. Limitations of YOLO（大约可以理解为不足）

2.4.1逐句翻译

第一段（相互临近的物体模型很难处理）

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class.
OLO对边界框预测施加了很强的空间约束，因为每个网格单元格只能预测两个框，并且只能有一个种类。
（这里理解一下，在代码里每个grid虽然有B=2组的（x，y，w，h，c）理论上可以预测两个不同的物体的两个框，但是20个类别的概率输出，只有一组，所以你预测的两个物体必须是一个类别。
所以实际上，也就预测一个物体而已。
）

This spatial constraint limits the number of nearby objects that our model can predict.
这个空间限制限制了我们的模型可以预测的相互距离较近的物体的数量。
（就是里的近的几个物体可能预测不到。）

Our model struggles with small objects that appear in groups, such as flocks of birds.
我们的模型和成群出现的物体是斗争的，比如鸟群。
（我各人理解这里的struggle应当理解为费力的解决了某事，也就是YOLOv1对解决成群的物体应该还行，例如鸟群。之所以认为其应该还行，因为这种成群是同种物体，所以YOLOv1是大约可以解决的，YOLOv1解决不了的是相互靠近的不同物体。）

第二段（可能会受到输入图片的情况的影响、并且很难保证有效的）

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
由于我们的设计是为了学习一个模型：这个从数据中预测边界框，所以它很难推广到新的或不寻常的高宽比或配置的对象。

Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
我们的模型还使用相对粗糙的特征来预测边界框，因为我们的架构有多个从输入图像向下采样的层。
（大约是下采样的时候的分辨率下降的思考，应该和deeplabv2是没有关系的，因为deeplabv2是2017年出来的这个yolo是2016年出来的所以应该没关系）

第三段（大小bounding box的问题）

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes.
最后，当我们训练一个近似检测性能的损失函数时，我们的损失函数对小包围盒和大包围盒中的错误处理是一样的。

A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.
大框中的小错误通常是良性的，但小框中的小错误对借据的影响要大得多。

Our main source of error is incorrect localizations.
错误的主要来源是框的位置的错误。

2.4.2总结—这段在说模型的不足

不足如下：

1.相互靠近的物体预测能力不行，这是由于输出的结构导致的，输出当中每个grid可以有B个bounding box，但是这些bounding box必须是一个类别，所以受到很大的限制，实际作用效果一般。
2.因为是圈出来物体的大小，受到画面拉伸的影响。
3.大小的bounding box对相同大小的偏差的敏感程度不同，作者这里实际上已经在损失函数中补救了，文章中表述的不足应当指的是当前的补足效果并不是最好，使得模型的主要错误来源还是框的位置定位错误。

3.Comparison to Other Detection Systems

3.1逐句翻译

第一段（大约就是陈述了之前的都是分成两个过程来走）

Object detection is a core problem in computer vision.（这里其实就是我们常说的CV）
目标检测是计算机视觉中的一个核心问题。

Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]).
检测管道通常首先从输入图像中提取一组鲁棒特征。（括号里的文章都是这么做的）

Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space.
然后，使用分类器[36,21,13,10]或定位器[1,32]来识别特征空间中的对象。

These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39].
这些分类器或定位器以滑动窗口的方式在整个图像或图像中的一些区域子集上运行[35,15,39]。

We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
我们将YOLO检测系统与几个顶级检测框架进行了比较，突出了关键的相似点和不同点。（就是把本文的YOLO和她们比较一下，看看有什么相似和不同。）

第二段（）

Deformable parts models (DPM) use a sliding window approach to object detection[10].
Deformable parts models使用滑动窗口方法来检测目标[10]

DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc.
Our system replaces all of these disparate parts with a single convolutional neural network.
The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently.
Instead of static features, the network trains the features in-line and optimizes them for the detection task.
Our unified architecture leads to a faster, more accurate model than DPM.
（这里都是古人的研究，之后再看，先看看yolo的研究内容）

4. Experiments

4.0 写在前面

4.0.1逐句翻译

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007.
首先，我们在PASCAL VOC 2007上对YOLO与其他实时检测系统进行了比较。

To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14].
为了理解YOLO和R-CNN变体之间的差异，我们探索了YOLO和Fast R-CNN (R-CNN表现最好的版本之一)对VOC 2007的错误。

Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost.
基于不同的错误配置，我们表明YOLO可以用于对Fast R-CNN检测进行重新评分，并减少来自背景误报的错误，从而显著提高性能。

We also present VOC 2012 results and compare mAP to current state-of-the-art methods.
我们还在VOC2012上做了试验，并将结果的mAP与目前最先进的方法进行了比较。

Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
最后，我们证明了YOLO比其他检测器在两个图像数据集上更好地泛化新领域。

4.0.2总结

1.试验使用了VOC2007和VOC2012数据集进行测试
2.因为没有state-of-art的效果好，作者用了很狡猾的描述：YOLO在背景上错误少，所以更加好用。

4.1 Comparison to Other Real-Time Systems（和最近的研究相比）

第一段（之前真正的实时性系统并不多，只有几个，就算不是实时系统我们也做了对比实验评估mAP降低换来的时间提升划算吗）

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28]
许多目标检测的研究工作都集中在使标准检测管道快速。
（就是之前的研究都是在原来模型上进行加速，而不打破原来的架构）

However, only Sadeghi et al. actually produce a deection system that runs in real-time (30 frames per second or better) [31].
只有Sadeghi等人真正产生了实时运行的检测系统(每秒30帧或更好)。

We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz.
我们比较了YOLO与运行在30Hz或100Hz的DPM的GPU实现。
（就是和DPM运行在30帧或是100帧的硬件条件下看看YOLO大约怎么样）

While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
虽然其他的工作没有达到实时的里程碑，我们也比较了它们的相对mAP和速度，以检查在目标检测系统中可用的准确性和性能折衷。

第二段（从Fast YOLO引出YOLO真快）

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector.
Fast YOLO是PASCAL上最快的目标检测方法;据我们所知，它是现存最快的物体探测器。

With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection.
它的mAP值为52.7%，但是却可以比之前的实时检测精度高出一倍多。

YOLO pushes mAP to 63.4% while still maintaining real-time performance.
YOLO将mAP提升到63.4%，同时仍然保持实时性能。

第三段（和VGG16的结合虽然准但是慢，其实真正的YOLO比VGG浅）

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO.
我们也用VGG-16训练YOLO。这个模型比YOLO更准确，但也明显慢。

It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper（这里没有s，特指这个这篇论文） focuses on our faster models.
这对于与其他依赖VGG-16的检测系统进行比较很有用，但由于它比实时更慢，本文的其余部分将重点放在我们更快的模型上。

（这里说一下，看一下yolov1的代码就可以看到，他也是卷积、标准化、激活、池化，只是比ygg16的层数少）

第四段（R-CNN minus R的失败）

R-CNN minus R replaces Selective Search with static bounding box proposals [20].
R- cnn 减 R用静态边界框proposal替换选择性搜索。

While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
虽然它比R-CNN快得多，但它仍然缺乏实时性，而且由于没有好的提案，它的准确性受到了很大的打击。

第五段（Fast R-CNN虽然很快但是仍有延迟）

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals.
Fast R-CNN加快了R-CNN的分类阶段，但它仍然依赖于选择性搜索，每幅图像大约需要2秒来生成边界框建议。

Thus it has high mAP but at 0.5 fps it is still far from realtime.
因此它有很高的mAP，但在0.5 fps的情况下，它离实时性还差得很远。

第六段（ Faster R-CNN The Zeiler-Fergus Faster R-CNN 虽然可以更快，都没有YOLO这么准）

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8]
最近的Faster R-CNN用神经网络代替了选择性搜索，提出了边界框，类似于Szegedy等人。

In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps.
在我们的测试中，他们最精确的模型达到7帧/秒，而更小、更不精确的模型运行在18帧/秒。

The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO.
Faster-R-CNN的VGG-16版本比YOLO高10 mAP，但也慢6倍。

The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
Zeiller-Fergus的 Faster R-CNN只比YOLO慢2.5倍，但也不太准确。

4.2. VOC 2007 Error Analysis

第一段

4.6总结

试验大约分成几个部分：
1.和最近的研究相比，YOLO虽然精度上可能有所欠缺，但是他快啊。

5.致歉

由于有其他的紧急项目，这里的学习暂时告一段里，所以没有完成YOLOv1的阅读

CUHK-SZ-relu

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
打赏
0
评论
YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection

0.Abstract0.1原文翻译第一段（说明本次研究和之前研究的区别）We present YOLO, a new approach to object detection.提出了一种新的目标检测方法YOLO。Prior work on object detection repurposes classifiers to perform detection.先前的对象检测工作使分类器重新进行检测。（也就是所谓的需要进行两次检验）Instead, we frame object detecti
复制链接

扫一扫