YOLOv1翻译（You Only Look Once: Unified, Real-Time Object Detection）

AILOCK

已于 2024-07-12 15:04:06 修改

阅读量621

点赞数 9

分类专栏： YOLO系列文章标签： YOLO 目标检测人工智能

于 2024-07-12 15:02:03 首次发布

本文链接：https://blog.csdn.net/smilejfy/article/details/140379500

版权

YOLO系列专栏收录该内容

6 篇文章 0 订阅

订阅专栏

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the version available on IEEE Xplore.

You Only Look Once: Unified, Real-Time Object Detection

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

我们发表了 YOLO，一种新的目标检测方法。先前关于目标检测的工作重新利用分类器来执行检测。取而代之的是，我们将目标检测构建成一个回归问题，该回归问题指向空间上分离的边界框和相关的类概率。单个神经网络可在一次评估中直接从完整图像预测边界框和类别概率。由于整个检测pipeline是单一网络，因此可以直接在检测性能上进行端到端的优化。我们的统一架构在执行上速度极快。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我们的base YOLO 模型以45 帧/秒的速度实时处理图像。该网络的较小版本Fast YOLO是155帧/秒，同时在mAP指标上，能得到他实时检测器的2倍mAP。与最先进的检测系统相比，YOLO 的定位误差更大，但预测背景误报的可能性更小。最后，YOLO学习了非常通用的目标表示。当从自然图像泛化到其他领域（如artwork）时，它优于其他检测方法，包括 DPM 和 R-CNN。

1 Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

人类瞥一眼图像就能够立即知道图像中有哪些物体，它们在哪里，以及它们如何相互作用。人类的视觉系统既快速又准确，使我们能够在不知不觉的情况下执行复杂的任务，例如驾驶。快速、准确的目标检测算法将使计算机能够在没有专门传感器的情况下驾驶汽车，使辅助设备能够向人类用户传递实时场景信息，并释放通用、反应灵敏的机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

当前的检测系统重新利用classifier来执行检测。为了检测物体，这些系统采用该物体的classifier，并在测试图像中的不同位置和尺度上对其进行评估。deformable parts models（DPM）等系统使用滑动窗口方法，其中classifier 在整个图像上均匀分布的位置运行[10]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

较新的方法（例如 R-CNN）使用region proposal methods首先在图像中生成潜在边界框，然后在这些proposed boxes上运行分类器。分类后，使用pos-processing来细化边界框，消除重复检测，并根据场景中的其他对象重新评分边界框 [13]。因为必须单独训练每个独立的组件，这些复杂的流程速度慢且难以优化。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们将目标检测重新定义为单个回归问题，直接从图像像素到边界框坐标和类概率。使用我们的系统只需看一次（YOLO）图像即可预测存在哪些对象以及它们的位置在哪里。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

YOLO 非常简单：见图 1。单个卷积网络同时预测多个边界框和这些框的类概率。YOLO 在完整图像上进行训练并直接优化检测性能。与传统的对象检测方法相比，这种统一模型有几个好处。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running
in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

首先，YOLO 速度极快。由于我们将检测定义为回归问题，因此我们不需要复杂的流程。我们只需在测试时对新图像运行神经网络即可预测检测结果。我们的基础网络在 Titan X GPU 上以每秒 45 帧的速度运行，无需批处理，fast version以超过 150 fps 的速度运行。这意味着我们可以实时处理流视频，延迟时间不到 25 ms。此外，YOLO 的mAP是其他实时系统的两倍多。有关我们的系统在网络摄像头上实时运行的演示，请参阅我们的项目网页：http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

其次，YOLO在进行预测时对图像进行全局推理。与sliding window和 region proposal-based 的技术不同，YOLO在训练和测试期间可以看到整个图像，因此它隐式编码有关类及其外观的上下文信息。Fast R-CNN是一种顶级检测方法[14]，它无法看到更大的背景，因此会将图像中的背景斑块误认为是物体。与快速 R-CNN 相比，YOLO 的背景错误数量不到后者的一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三，YOLO 学习对象的通用表征。在自然图像上进行训练并在 artwork 上进行测试时，YOLO 的表现远远超过了 DPM 和 R-CNN 等顶级检测方法。由于 YOLO 具有很强的通用性，因此在应用于新领域或意外输入时，不太可能出现问题。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO在精度方面仍然落后于最先进的检测系统。虽然它可以快速识别图像中的物体，但它很难精确定位某些物体，尤其是小物体。我们在实验中进一步研究了这些权衡。（小物体检测并没有优势）

All of our training and testing code is open source. A variety of pretrained models are also available to download.

我们所有的训练和测试代码都是开源的。各种预训练模型也可供下载。

2 Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.

我们将目标检测的各个组成部分统一到一个神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还能同时预测图像中所有类别的所有边框。这意味着我们的网络可以对整个图像和图像中的所有物体进行全局推理。YOLO 设计可实现端到端训练和实时速度，同时保持较高的平均精度。

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统将输入图像划分为 $\times S$ 网格。如果目标的中心落入栅格单元中，则该栅格单元负责检测该对象。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as $Pr(Object)*IOU_{truth}^{pred}$ . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个网格单元预测 $B$ 个边界框和这些框的置信度分数。这些置信度分数反映了模型对box包含对象的置信度，以及它认为box预测的准确度。正式地，我们将置信度定义为 $Pr(Object)*IOU_{truth}^{pred}$ 。如果该单元格中不存在任何目标，则置信度分数应为零。否则，我们希望置信度分数等于预测框与 ground truth 之间的IOU，即（ $Pr(Object)*IOU_{truth}^{pred}$ ， $P r (O bj ec t) = 1$ ）。

Each bounding box consists of 5 predictions: $x, y, w, h$ , and $co n f i d e n ce$ . The $ (x, y)$ coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框由 5 个预测组成： $x, y, w, h$ 和 $co n f i d e n ce$ 。 $(x, y)$ 坐标表示方框中心相对于grid cell边界的位置。宽度和高度是相对于整个图像预测的。最后，置信度预测表示预测框与任何 ground truth 框之间的 IOU。

Each grid cell also predicts C conditional class probabilities, $Pr(Class_i|Object)$ . These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B. At test time we multiply the conditional class probabilities and the individual box confidence predictions,

每个网格单元还预测 C 个（C表示类别的数量）条件类概率，即 $Pr(Class_i|Object)$ 。这些概率以包含对象的网格单元为条件。在测试时，我们将条件类别概率与单个方框置信度预测相乘。
$Pr(Class_i|Object)*Pr(Object)*IOU_{pred}^{truth} = Pr(Class_i)*IOU_{pred}^{truth}$
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

从而得出每个方框中特定类别的置信度分数。这些分数既表示该类出现在预测框中的概率，也表示预测框与对象的匹配程度。

我们的系统将检测建模为回归问题。它将图像划分为 $S \times S$ 网格，并为每个网格单元预测 $B$ 个边界框、这些框的置信度和 $C$ 类概率。这些预测被编码为 $S\times S\times(B\times5 + C)$ 张量。

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.

为了评估YOLO对PASCAL VOC，我们使用 $S = 7$ ， $B = 2$ 。PASCAL VOC 有 20 个标记类别，因此$ C = 20$。我们的最终预测是 $7 \times 7 \times 30$ 张量（即 $7\times 7\times (2\times 5 + 20) = 7\times 7\times 30$ ）。

2.1 Network Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

我们以卷积神经网络的形式实现了这一模型，并在 PASCAL VOC 检测数据集上对其进行了评估[9]。网络的初始卷积层从图像中提取特征，而全连接层则预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [33]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convolutional layers, similar to Lin et al [22]. The full network is
shown in Figure 3.

我们的网络架构受到用于图像分类的GoogLeNet模型的启发[33]。我们的网络有 24 个卷积层，然后是 2 个全连接层。我们没有使用GoogLeNet使用的初始模块，而是简单地使用1×1个还原层，然后是3×3卷积层，类似于Lin等人[22]。整个网络如图 3 所示。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

我们还训练了快速版 YOLO，旨在突破快速目标检测的极限。Fast YOLO 使用的神经网络的卷积层数较少（9 层而不是 24 层），而且卷积层中的滤波器数量也较少（即权重数量少）。除网络规模外，YOLO 和Fast YOLO 的所有训练和测试参数均相同。

The final output of our network is the 7 × 7 × 30 tensor of predictions.

我们网络最终的输出是 $\times 7 \times 30$ 的张量。

2.2 Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [29]. For pretraining we use
the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train
this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation
set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].

我们在 ImageNet-1000-calss竞赛数据集 [29] 上预训练卷积层。对于预训练，我们使用图 3 中的前 20 个卷积层，然后是平均池化层和全连接层。我们训练这个网络大约一周，在ImageNet 2012验证集上实现了88%的单次裁剪top-5准确率，与Caffe的Model Zoo中的GoogLeNet模型相当[24]。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers
to pretrained networks can improve performance [28]. Following their example, we add four convolutional layers
and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information
so we increase the input resolution of the network from 224 × 224 to 448 × 448.

然后，我们转换模型以执行检测。Ren et al. 等人表明，在预训练网络中添加卷积层和连接层可以提高性能[28]。按照他们的示例，我们添加了四个卷积层和两个具有随机初始化权重的全连接层。检测通常需要细粒度的视觉信息，因此我们将网络的输入分辨率从 224 × 224 提高到 448 × 448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box
width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x
and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

我们的最后一层预测了class probabilities和bounding box coordinates。我们通过图像宽度和高度对边界框的宽度和高度进行归一化，使它们介于 0 和 1 之间。我们将边界框 x 和 y 坐标参数化为特定格网像元位置的偏移量，因此它们也位于 0 和 1 之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

我们对最后一层使用线性激活函数，其他各层都使用如下的 leakyRELU 线性激活：：
$KaTeX parse error: Unknown column alignment: * at position 37: … \begin{array}{*̲*lr**} x …$
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

我们对模型输出的sum-squared error进行优化。我们使用sum-squared error是因为它易于优化，但它与我们最大化平均精度的目标并不完全一致。它将定位误差与分类误差加权相等，这可能并不理想。此外，在每幅图像中，许多网格单元都不包含任何物体。这就会将这些单元格的 "置信度 "分数推向零，往往会压倒包含物体的单元格的梯度。这可能会导致模型不稳定，使训练在早期就出现偏差。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence
predictions for boxes that don’t contain objects. We use two parameters, coord and noobj to accomplish this. We set coord = 5 and noobj = 0.5.

为了解决这个问题，我们增加了边界框坐标预测的损失，并减少了不包含对象的框的置信度预测的损失。我们使用两个参数 coord 和 noobj 来实现这一点。我们设置 coord = 5 和 noobj = 0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that
small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root
of the bounding box width and height instead of the width and height directly.

和平方误差也同样加权大方框和小框中的误差。我们的误差指标应该反映出，大盒子中的小偏差比小盒子中的小偏差更重要。为了部分解决这个问题，我们预测边界框宽度和高度的平方根，而不是直接预测宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO 为每个网格单元预测多个边框。在训练时，我们只想让一个边界框预测器负责每个对象。我们分配一个预测器 "负责 "预测一个对象，其依据是哪个预测结果当前与ground truth实况的 IOU 值最高。这就导致了边界框预测器之间的专门化。每个预测器都能更好地预测特定尺寸、长宽比或类别的物体，从而提高整体的召回率。

During training we optimize the following, multi-part loss function:

在训练过程中，我们对以下多部分损失函数进行优化：

where $1_i^{obj}$ denotes if object appears in cell $i$ and $1^{obj}_{ij}$ denotes that the $j$ th bounding box predictor in cell $i$ is “responsible” for that prediction.

其中 $1^{obj}_i$ 表示对象是否出现在单元格 $i$ 中， $1^{obj}_{ij}$ 表示单元格 $i$ 中的第 $j$ 个边界框预测器“负责”该预测。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

请注意，仅当该网格单元中存在对象时，损失函数才会惩罚分类错误（因此，前面讨论的条件类概率）。此外，仅当该预测器对真值框“负责”时，它才会惩罚边界框坐标误差（即，该网格单元中任何预测器的 IOU 最高）。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.

我们在 PASCAL VOC 2007 和 2012 的训练和验证数据集上对网络进行了约 135 个历元的训练。在VOC 2012上进行测试时，我们还使用 VOC 2007 测试数据进行训练。在整个训练过程中，我们使用的批次大小为 64，动量为 0.9，衰减为 0.0005。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from $10^{−3}$ to $10^{−2}$ . If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with $10^{−2}$ for 75 epochs, then $10^{−3}$ for 30 epochs, and finally $10^{−4}$ for 30 epochs.

我们的学习率时间表如下：对于第一个epoch，我们慢慢地将学习率从 $10^{-3}$ 提高到 $10^{-2}$ 。如果我们从高学习率开始，我们的模型经常会由于梯度不稳定而发散。我们继续用 $10^{-2}$ 训练 75 个时期，然后用 $10^{-3}$ 训练 30 个时期，最后用 $10^{-4}$ 训练 30 个时期。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

为了避免过度拟合，我们使用丢弃和大量数据增强。在第一连接层之后，rate = 0.5 的dropout layer阻止了层之间的协同适应 [18]。在数据增强方面，我们引入了随机缩放和平移，最高可达原始图像大小的 20%。我们还在 HSV 色彩空间中随机调整图像的曝光和饱和度，最高可达 1.5 倍。

2.3 Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

就像在训练中一样，预测测试图像的检测只需要一次网络评估。在 PASCAL VOC 上，该网络预测每张图像 98 个边界框，并预测每个框的类别概率。YOLO在测试时速度极快，因为它只需要一个网络评估，这与基于分类器的方法不同。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for
each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.

grid设计在边界框预测中强制实施了空间多样性。通常，一个对象落入哪个网格单元是很清楚的，而网络只预测每个对象的一个框。但是，一些大型物体或靠近多个单元格边界的物体可以被多个单元很好地定位。非最大值抑制可用于修复这些多次检测。虽然不像 R-CNN 或 DPM 那样对性能至关重要，但非最大抑制会增加 2-3% 的 mAP。

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

YOLO 对边界框预测施加了强烈的空间约束，因为每个网格单元只预测两个框，并且只能有一个类。这种空间约束限制了我们的模型可以预测的附近对象的数量。我们的模型在检测成群出现的小物体上很困难，例如成群结队的鸟。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由于我们的模型学会了从数据中预测边界框，因此它很难泛化到具有新的或不寻常的纵横比或配置的对象。我们的模型还使用相对粗糙的特征来预测边界框，因为我们的架构从输入图像中具有多个下采样层。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

最后，虽然我们使用近似于检测性能的损失函数进行训练，但我们的损失函数在小边界框和大边界框中处理相同的错误。大箱子里的小错误通常是良性的，但小箱子里的小错误对借条的影响要大得多。我们的主要错误来源是不正确的本地化

3 Comparison to Other Detection Systems

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers[35, 21, 13, 10] or localizers [1, 31] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [34, 15, 38]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

物体检测是计算机视觉领域的核心问题。检测pipelines一般从输入图像中提取一组robust features（Haar [25]、SIFT [23]、HOG [4]、卷积特征 [6]）、HOG [4]、卷积特征 [6]）。然后，使用分类器[35, 21, 13, 10]或localizers[1, 31]来识别特征空间中的物体。这些classifiers或localizers以滑动窗口方式在整个图像或图像中的某些区域子集上运行 [34, 15, 38]。我们将 YOLO 检测系统与几种顶级检测框架进行比较，突出主要的异同点。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection[10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

Deformable parts models。Deformable parts models（DPM）使用滑动窗口方法进行物体检测[10]。DPM 使用一个不相连的pipelines来提取静态特征、对区域进行分类、预测高分区域的边界框等。我们的系统用一个卷积神经网络取代了所有这些不同的部分。该网络可同时执行特征提取、边界框预测、非最大抑制和上下文推理。该网络不使用静态特征，而是在线训练特征，并针对检测任务对其进行优化。与 DPM 相比，我们的统一架构能带来更快、更准确的模型。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [34] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

R-CNN. R-CNN 及其变体使用region proposals而不是滑动窗口来查找图像中的物体。选择性搜索 [34] 生成潜在的边界框，卷积网络提取特征，SVM 对边界框进行评分，线性模型调整边界框，非最大抑制(NMS)消除重复检测。这一复杂pipelines的每个阶段都必须独立进行精确调整，因此系统运行速度非常慢，测试时每幅图像需要 40 多秒[14]。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

YOLO 与 R-CNN 有一些相似之处。每个网格单元都会提出潜在的边界框，并利用卷积特征对这些边界框进行评分。不过，我们的系统对网格单元的提议施加了空间限制，这有助于减少同一物体的多次检测。我们的系统提出的边界框数量也要少得多，每幅图像只需 98 个，而选择性搜索则需要约 2000 个。最后，我们的系统会将这些单独的组件整合到一个联合优化的模型中。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [27]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

其他快速检测器 Fast 和 Faster R-CNN 专注于通过共享计算来加快 R-CNN 框架的速度，并使用神经网络来提出区域而不是选择性搜索 [14] [27]。与 R-CNN 相比，它们在速度和准确性上都有所提高，但仍达不到实时性能。

Many research efforts focus on speeding up the DPM pipeline [30] [37] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [30] actually runs in real-time.

许多研究工作都集中在加速 DPM pipeline上 [30] [37] [5]。它们加快了 HOG 计算速度，使用级联，并将计算推向 GPU。然而
DPM [30] 只能真正实时运行30Hz 。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

YOLO 并不试图优化大型检测pipelines中的单个组件，而是完全抛开了pipelines，并在设计上实现了快速，而不是试图优化大型检测pipelines的单个组件，YOLO 完全抛弃了流水线的设计，因此速度很快。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [36]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.

针对人脸或人物等单一类别的检测器可以进行高度优化，因为它们需要处理的变化要少得多 [36]。YOLO 是一种通用检测器，可
学习同时检测各种物体。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, Multi-Box cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.

**Deep MultiBox. **与 R-CNN 不同，Szegedy et al.等人训练一个卷积神经网络来预测感兴趣的区域 [8]，而不是使用选择性搜索。MultiBox 还能用单类预测取代置信度预测，从而进行单个物体检测。不过，Multi-Box 无法进行一般物体检测，仍然只是更大的检测管道中的一个环节，需要进一步的图像斑块分类。YOLO 和 MultiBox 都使用卷积网络来预测图像中的边界框，但 YOLO 是一个完整的检测系统。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [31]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. Over-Feat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

OverFeat。 Sermanet et al.等人训练了一个卷积神经网络来执行定位，并调整该localizer 来执行检测[31]。OverFeat 可以有效地执行滑动窗口检测，但它仍然是一个脱节系统。Over-Feat 优化的是定位，而不是检测性能。与 DPM 一样，定位器在进行预测时只能看到局部信息。OverFeat 无法推理全局上下文，因此需要进行大量的后处理才能产生一致的检测结果。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [26]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.

MultiGrasp. 我们的工作在设计上类似于 Redmon 等人[26]的抓取检测工作。我们的边界框预测网格方法基于 MultiGrasp 系统，用于对抓取进行回归。然而，抓取检测是一项比物体检测简单得多的任务。MultiGrasp 只需要预测包含一个物体的图像中的单一可抓取区域。它不需要估计物体的大小、位置或边界，也不需要预测物体的类别，只需要找到适合抓取的区域即可。YOLO 可以预测图像中多个物体的边界框和类别概率。

4 Experiments

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.

首先，我们将 YOLO 与 PASCAL VOC 2007 上的其他实时检测系统进行了比较。为了了解 YOLO 和 R-CNN 变体之间的差异，我们探讨了 YOLO 和快速 R-CNN 在 VOC 2007 上的误差。 YOLO 和fast R-CNN（R-CNN 的最高性能版本之一[14]）在 VOC 2007 上的误差。根据不同的误差情况，我们发现 YOLO 可用于对 Fast R-CNN 检测进行重新评分，减少背景误报带来的误差，从而显著提高性能。我们还介绍了 VOC 2012 的结果，并将 mAP 与当前最先进的方法进行了比较。最后，我们还在两个艺术作品数据集上展示了 YOLO 在新领域的通用性优于其他检测器。

4.1 Comparison to Other Real-Time Systems

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [37] [30] [14] [17] [27] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [30]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.

许多目标检测方面的研究工作都集中在使标准检测 pipelines 快速运行上。[5] [37] [30] [14] [17] [27] 然而，只有 Sadeghi 等人真正开发出了实时（每秒 30 帧或更好）运行的检测系统 [30]。我们将 YOLO 与他们在 GPU 上实现的 DPM 进行了比较，后者的运行频率为 30Hz 或 100Hz。虽然其他工作没有达到实时的里程碑，但我们也比较了它们的相对 mAP 和速度，以研究目标检测系统中的精度-性能权衡。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.

Fast YOLO 是 PASCAL 上最快的目标检测方法；据我们所知，它也是现存最快的目标检测器。它的 mAP 为 52.7%，比之前的实时检测工作的准确率高出两倍多。YOLO 将 mAP 提高到 63.4%，同时仍保持实时性能。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.

我们还使用 VGG-16 训练 YOLO。该模型的准确性更高，但速度也明显慢于 YOLO。它有助于与其他依赖 VGG-16 的检测系统进行比较，但由于它的速度比实时速度慢，本文其余部分将重点讨论我们的更快模型。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [37]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.

Fastest DPM 可以有效地加快 DPM 的速度，而不会牺牲太多的 mAP，但它的实时性仍然要差 2 倍[37]。与神经网络方法相比，DPM 的检测精度相对较低，这也限制了它的使用。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.

R-CNN minus R 用静态边界框提议取代了选择性搜索 [20]。虽然它比 R-CNN 快得多，但仍达不到实时性，而且由于没有好的建议，准确率会受到很大影响。

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from realtime.

快速 R-CNN 加快了 R-CNN 的分类阶段，但它仍然依赖于选择性搜索，每幅图像生成边界框建议需要大约 2 秒钟的时间。因此，它具有较高的 mAP，但 0.5 fps 的速度离实时还很远。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.

最近推出的 Faster R-CNN 用神经网络取代了选择性搜索来propose bounding boxes，与 Szegedy 等人的方法类似[8]。在我们的测试中，他们最精确的模型能达到 7 帧/秒，而精确度较低的模型则能达到 18 帧/秒。VGG-16 版本的 Faster R-CNN 比 YOLO 高 10 mAP，但速度也慢 6 倍。Zeiler-Fergus Faster R-CNN 只比 YOLO 慢 2.5 倍，但准确度也较低。

4.2 VOC 2007 Error Analysis

To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast RCNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.

为了进一步研究 YOLO 与最先进的检测器之间的差异，我们查看了 2007 年 VOC 检测结果的详细分类。我们将 YOLO 与 Fast RCNN 进行了比较，因为 Fast R-CNN 是 PASCAL 上性能最高的检测器之一，而且它的检测结果是公开的。

We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:

我们采用了 Hoiem 等人的方法和工具[19]。对于测试时的每个类别，我们会查看该类别的前 N 个预测结果。每个预测要么正确，要么根据错误类型进行分类：

Correct: correct class and IOU > .5
Localization: correct class, .1 < IOU < .5
Similar: class is similar, IOU > .1
Other: class is wrong, IOU > .1
Background: IOU < .1 for any object
正确：类别正确且 IOU > .5
定位：类别正确，.1 < IOU < .5
相似：类别相似，IOU > .1
其他：类是错误的，IOU > .1
背景：任何对象的 IOU < .1

Figure 4 shows the breakdown of each error type averaged across all 20 classes.

图 4 显示了所有 20 个class中每种错误类型的平均值。

YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.

YOLO 难以正确定位对象。在 YOLO 的错误中，定位错误所占的比例比其他所有错误的总和还要高。快速 R-CNN 的定位错误要少得多，但背景错误却多得多。它的最高检测结果中有 13.6% 是不包含任何目标的误报。快速 R-CNN 预测背景检测的可能性几乎是 YOLO 的 3 倍。

4.3. Combining Fast RCNN and YOLO

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.

与快速 R-CNN 相比，YOLO 犯的背景错误要少得多。通过使用 YOLO 来消除快速 R-CNN 的背景检测，我们可以显著提高性能。对于 R-CNN 预测的每一个边界框，我们都会检查 YOLO 是否预测了类似的边界框。如果有，我们就会根据 YOLO 预测的概率和两个方框之间的重叠情况对该预测进行提升。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.

在 VOC 2007 测试集上，最佳快速 R-CNN 模型的 mAP 为 71.8%。当与 YOLO 结合使用时，其 mAP 增加了 3.2%，达到 75.0%。我们还尝试将我们还尝试将顶级快速 R-CNN 模型与其他几个版本的快速 R-CNN 结合起来。这些组合使 mAP 略有增加，增幅在 0.3% 到 0.6% 之间，详见表 2。

The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.

YOLO 带来的提升并不仅仅是模型集合的副产品，因为将不同版本的 Fast R-CNN 结合在一起几乎没有什么好处。相反，正是因为
正因为 YOLO 在测试时会犯不同类型的错误，所以才能如此有效地提升快速 R-CNN 的性能。

Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.

遗憾的是，这种组合并没有从 YOLO 的速度中获益，因为我们是先单独运行每个模型，然后再合并结果。不过，由于 YOLO 的速度非常快与快速 R-CNN 相比，它并没有增加大量的计算时间。

4.4. VOC 2012 Results

On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.

在 VOC 2012 测试集上，YOLO 的 mAP 得分为 57.9%。这低于目前的技术水平，更接近使用 VGG-16 的原始 R-CNN 的得分，见表 3。我们的系统与最接近的竞争对手相比，我们的系统在处理小物体时比较吃力。在瓶子、羊和电视/显示器等类别上，YOLO 的得分比 R-CNN 或 Feature Edit 低 8-10%。不过，在猫和火车等其他类别上，YOLO 的表现则更出色。

Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.

我们的快速 R-CNN + YOLO 组合模型是性能最高的检测方法之一。快速 R-CNN 与 YOLO 的组合提高了 2.3%、在公众排行榜上提升了 5 位。

4.5. Generalizability: Person Detection in Artwork

Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.

用于目标检测的学术数据集从相同的分布中提取训练数据和测试数据。在真实世界的应用中，很难预测所有可能的使用情况，测试数据也可能与系统之前看到的数据不同[3]。我们在 Picasso 数据集[12]和 People-Art 数据集[3]上将 YOLO 与其他检测系统进行了比较，这两个数据集用于测试艺术品上的人物检测。

Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.

图 5 显示了 YOLO 与其他检测方法的性能比较。作为参考，我们给出了 VOC 2007 在人物上的检测结果，其中所有模型都只在 VOC 2007 数据上进行了训练。在 Picasso 上，模型是根据 VOC 2012 数据训练的，而在 People-Art 上，模型是根据 VOC 2010 数据训练的。

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.

R-CNN 在 VOC 2007 上的 AP 很高。然而，当 R-CNN 应用于artwork 时，其性能却大幅下降。R-CNN 使用选择性搜索技术提出边界框建议，该技术针对自然图像进行了调整。针对自然图像进行调整。R-CNN 中的分类器步骤只能看到较小的区域，因此需要良好的提议。

DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.

DPM 在应用于 artwork 时能很好地保持其 AP 性能。先前的理论认为，DPM 性能良好是因为它对物体的形状和布局具有强大的空间模型。虽然 DPM 不会像 R-CNN 那样退化，但它的起始 AP 较低。

YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.

YOLO 在 VOC 2007 上有很好的表现，当应用于艺术品时，它的 AP 退化程度比其他方法要小。与 DPM 一样，YOLO 对物体的大小和形状、物体之间的关系以及物体经常出现的位置进行建模。artwork 和自然图像在像素层面上有很大不同，但在物体的大小和形状方面却很相似，因此 YOLO 仍能预测良好的边界框和检测。

5 Real-Time Detection In TheWild

YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance including the time to fetch images from the camera and display
the detections.

YOLO 是一种快速、准确的物体检测器，是计算机视觉应用的理想选择。我们将 YOLO 连接到网络摄像头，并验证它是否能保持实时性能，包括从摄像头获取图像和显示检测的时间。

The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.

由此产生的系统具有互动性和吸引力。当 YOLO 单独处理图像时，当连接到网络摄像头时，它的功能就像一个跟踪系统，可以检测物体的移动和外观变化。系统演示和源代码可在我们的项目网站 http://pjreddie.com/yolo/ 上找到。

6 Conclusion

We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.

我们介绍了用于目标检测的统一模型 YOLO。我们的模型构建简单，可直接在完整图像上进行训练。与基于分类器的方法不同，YOLO 是根据直接对应于检测性能的损失函数进行训练的，而且整个模型是联合训练的。与检测性能直接对应的损失函数进行训练，并对整个模型进行联合训练。

Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.

快速 YOLO 是目前文献中速度最快的通用目标检测器，YOLO 推动了实时目标检测技术的发展。YOLO 还能很好地扩展到新的领域，因此非常适合依赖于快速、稳健目标检测的应用。

Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen Distinguished Investigator Award.

致谢：本研究的部分成果由 ONR N00014-13-1-0720、NSF IIS-1338054 和艾伦杰出研究员奖提供支持。

References

AILOCK

关注

9
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
YOLOv1翻译（You Only Look Once: Unified, Real-Time Object Detection）

我们发表了 YOLO，一种新的目标检测方法。先前关于目标检测的工作重新利用分类器来执行检测。取而代之的是，我们将目标检测构建成一个回归问题，该回归问题指向空间上分离的边界框和相关的类概率。单个神经网络可在一次评估中直接从完整图像预测边界框和类别概率。由于整个检测pipeline是单一网络，因此可以直接在检测性能上进行端到端的优化。我们的统一架构在执行上速度极快。我们的base YOLO 模型以45 帧/秒的速度实时处理图像。
复制链接

扫一扫