目标检测经典论文——YOLO论文翻译：You Only Look Once: Unified, Real-Time Object Detection（YOLO：统一的实时目标检测）

最新推荐文章于 2025-02-06 11:32:34 发布

bigcindy

最新推荐文章于 2025-02-06 11:32:34 发布

阅读量4.4k

点赞数 9

分类专栏：深度学习经典论文翻译文章标签： YOLO 实时目标检测 VOC MS COCO Fast YOLO

本文链接：https://blog.csdn.net/Jwenxue/article/details/107684001

版权

You Only Look Once: Unified, Real-Time Object Detection

YOLO：统一的实时目标检测

Joseph Redmon, Santosh Divvala†, Ross Girshick¶, Ali Farhadi*†

University of Washington*, Allen Institute for AI†, Facebook AI Research¶

http://pjreddie.com/yolo/

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

摘要

我们提出了YOLO，一种新的目标检测方法。以前的目标检测工作重复利用分类器来完成检测任务。相反，我们将目标检测框架看作回归问题，从空间上分割边界框和相关的类别概率。单个神经网络在一次评估中直接从整个图像上预测边界框和类别概率。由于整个检测流水线是单一网络，因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我们的统一架构非常快。我们的基础YOLO模型以45帧/秒的速度实时处理图像。Fast YOLO是YOLO的一个较小版本，每秒能处理惊人的155帧图像，同时实现其它实时检测器两倍的mAP。与最先进的检测系统相比，YOLO虽然存在较多的定位错误，但很少将背景预测成假阳性（译者注：其它先进的目标检测算法将背景预测成目标的概率较大）。最后，YOLO能学习到目标非常通用的表示。当从自然图像到艺术品等其它领域泛化时，它都优于其它检测方法，包括DPM和R-CNN。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

1. 引言

人们瞥一眼图像，立即知道图像中的物体是什么，它们在哪里以及它们如何相互作用。人类的视觉系统是快速和准确的，使我们能够执行复杂的任务，例如如驾驶车辆时不会刻意地进行思考或思想。快速、准确的目标检测算法可以让计算机在没有专用传感器的情况下驾驶汽车，使辅助设备能够向人类用户传达实时的场景信息，并具有解锁通用目的和响应机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

目前的检测系统重复利用分类器来执行检测。为了检测目标，这些系统为该目标提供一个分类器，并在不同的位置对其进行评估，并在测试图像中进行缩放。像可变形部件模型（DPM）这样的系统使用滑动窗口方法，其分类器在整个图像的均匀间隔的位置上运行[10]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

最近的许多方法，如R-CNN使用region proposal方法首先在图像中生成潜在的边界框，然后在这些提出的框上运行分类器。在分类之后，通过后处理对边界框进行修正，消除重复的检测，并根据场景中的其它目标重新定位边界框[13]。这些复杂的流程很慢，很难优化，因为每个单独的组件都必须单独进行训练。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们将目标检测重构并看作为单一的回归问题，直接从图像像素到边界框坐标和类别概率。使用我们的系统，您只需要在图像上看一次（you only look once, YOLO），以预测出现的目标和位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

YOLO新奇又很简单：如图1所示。单个卷积网络同时预测这些框的多个边界框和类别概率值。YOLO在全图像上训练并直接优化检测性能。这种统一的模型比传统的目标检测方法有一些好处。

图1：YOLO检测系统。用YOLO处理图像简单直接。我们的系统（1）将输入图像调整为448×448，（2）在图像上运行单个卷积网络，以及（3）由模型的置信度对所得到的检测进行阈值处理。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

首先，YOLO速度非常快。由于我们将检测视为回归问题，所以我们不需要复杂的流程。测试时我们在一张新图像上简单的运行我们的神经网络来预测检测。我们的基础网络以每秒45帧的速度运行，在Titan X GPU上没有批处理，快速版本运行速度超过150fps。这意味着我们可以在不到25毫秒的延迟内实时处理流媒体视频。此外，YOLO实现了其它实时系统两倍以上的mAP。关于我们的系统在网络摄像头上实时运行的演示，请参阅我们的项目网页：http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

其次，YOLO在进行预测时，会对图像进行全局地推理。与基于滑动窗口和region proposal的技术不同，YOLO在训练期间和测试时会看到整个图像，所以它隐式地编码了关于类的上下文信息以及它们的外形。Fast R-CNN是一种顶级的检测方法[14]，但因为它看不到更大的上下文，所以在图像中会将背景块误检为目标。与Fast R-CNN相比，YOLO的背景误检数量少了一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三，YOLO学习目标的泛化表示。当在自然的图像上进行训练并对艺术作品进行测试时，YOLO大幅优于DPM和R-CNN等顶级检测方法。由于YOLO具有高度泛化能力，因此在应用于新领域或碰到非正常输入时很少出故障。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO在准确度上仍然落后于最先进的检测系统。虽然它可以快速识别图像中的目标，但它仍在努力精确定位一些目标，尤其是一些小目标。我们在实验中会进一步检查这些权衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.

我们所有的训练和测试代码都是开源的。各种预训练模型也都可以下载。

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

2. 统一的检测

我们将目标检测的单独组件集成到单个神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还可以同时预测一张图像中的所有类别的所有边界框。这意味着我们的网络全面地推理整张图像和图像中的所有目标。YOLO设计可实现端到端训练和实时的速度，同时保持较高的平均精度。

Our system divides the input image into an S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统将输入图像分成S×S的网格。如果一个目标的中心落入一个网格单元中，该网格单元负责检测该目标。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object)∗IOUtruthpred. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个网格单元预测这些盒子的B个边界框和置信度分数。这些置信度分数反映了该模型对盒子是否包含目标的置信度，以及它预测盒子的准确程度。在形式上，我们将置信度定义为Pr(Object)∗IOUtruthpred。如果该单元格中不存在目标，则置信度分数应为零。否则，我们希望置信度分数等于预测框与真实值之间联合部分的交集（IOU）。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框包含5个预测：x、y、w、h和置信度。(x，y)坐标表示边界框相对于网格单元边界框的中心。宽度和高度是相对于整张图像预测的。最后，置信度预测表示预测框与实际边界框之间的IOU。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes BB.

每个网格单元还预测C个条件类别概率Pr(Classi|Object)。这些概率以包含目标的网格单元为条件。每个网格单元我们只预测的一组类别概率，而不管边界框的的数量B是多少。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在测试时，我们乘以条件类概率和单个盒子的置信度预测值：

它为我们提供了每个框特定类别的置信度分数。这些分数编码了该类出现在框中的概率以及预测框拟合目标的程度。

For evaluating YOLO on Pascal VOC, we use S=7, B=2. Pascal VOC has 20 labelled classes so C=20. Our final prediction is a 7×7×30 tensor.

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S×S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S×S×(B∗5+C) tensor.

为了在Pascal VOC上评估YOLO，我们使用S=7，B=2。Pascal VOC有20个标注类，所以C=20。我们最终的预测是7×7×30的张量。