YOLOv1（论文翻译）

最新推荐文章于 2024-07-20 17:12:48 发布

SF_ORION

最新推荐文章于 2024-07-20 17:12:48 发布

阅读量1k

点赞数

文章标签： python 计算机视觉机器学习

原文链接：http://pjreddie.com/yolo/

版权

You Only Look Once: Unified, Real-Time Object Detection你只看一次：统一的，实时的目标检测http://pjreddie.com/yolo/Abstract摘要We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we fra

摘要由CSDN通过智能技术生成

You Only Look Once: Unified, Real-Time Object Detection

你只看一次：统一的，实时的目标检测

http://pjreddie.com/yolo/

Abstract

摘要
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network pre- dicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
我们提出了一种新的目标检测方法：YOLO。先前的目标检测工作都是重新利用分类器来执行检测。相反，我们将目标检测作为一个回归问题来处理空间分离的边界框和相关的类别概率。单个神经网络在一次评估中能够直接从整个图像预测边界框和类别概率。由于整个检测流程是一个单一的网络，因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
我们的统一网络架构速度非常快。我们的基本YOLO模型以每秒45帧的速度实时处理图像。另一个更小版本的网络Fast YOLO可以达到令人震惊的每秒处理155帧，同时仍然可以实现其他实时检测器的两倍mAP。与最先进的检测系统相比，YOLO的定位误差更大，但它不太可能在背景中预测假阳性（False Positive）目标。最后，YOLO可以学习对象泛化性很强的特征。当从自然图像推广到其他领域（如艺术作品）时，它优于其他检测方法，包括DPM和R-CNN。

1.Introduction

引言
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
人类只需要看一眼图像，就会立刻知道图像中有什么物体，它们在哪里，以及它们是如何相互作用的。人类的视觉系统是快速而准确的，允许我们执行复杂的任务，如驾驶时很少产生有意识的想法。快速、准确的目标检测算法将允许计算机在没有专用传感器的情况下驾驶汽车，使辅助设备能够向人类用户传送实时的场景信息，并释放通用、响应迅速的机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
当前的检测系统重新利用分类器来进行检测。为了检测一个目标，这些系统为该目标提供一个分类器，并在测试图像中的不同位置和尺寸对其进行评估。像deformable parts models（DPM）这样的系统使用滑动窗口方法，其中分类器在整个图像上以均匀间隔的位置运行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
最近的一些方法，如R-CNN，使用候选区域（region proposal）方法首先在图像中生成潜在的边界框，然后在这些边界框上运行分类器。分类后，后处理用于优化边界框，消除重复检测，并根据场景中的其他目标对边界框重新定位。因为每个单独的组件都必须单独训练，导致这些复杂的流程不仅速度慢而且很难优化。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
我们将目标检测重新看作是一个单一的回归问题，直接从图像像素到边界框坐标和类概率。使用我们的系统，您只需查看一次（YOLO）图像就可以预测哪些对象存在以及它们在哪里。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
YOLO非常简单：见图1。单个卷积网络同时预测多个边界框和这些边界框的类概率。YOLO在整个图像上训练，并直接优化检测性能。与传统的目标检测方法相比，这种统一的模型有许多优点。
在这里插入图片描述

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448× 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
图1：YOLO检测系统。用YOLO处理图像简单直接的。我们的系统（1）将输入图像的大小调整为448×448，（2）对图像运行一个卷积网络，（3）根据模型的置信度对检测道德结果进行阈值处理。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
首先，YOLO非常快。因为我们将检测看作回归问题，所以不需要复杂的流程。测试时我们只需要在新图像上运行我们的神经网络来预测检测结果。没有批处理时，在Titan X GPU上我们的基本网络能够以每秒45帧的速度运行，而快速版本的运行速度超过150 fps。这意味着我们可以在不到25毫秒的延迟时间内实时处理流媒体视频。此外，YOLO的mAP（mean average precision）是其他实时系统的两倍以上。有关我们的系统在网络摄像头上实时运行的演示，请参见我们的项目网页：http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
其次，YOLO在做预测时对图像进行了全面的推理。与基于滑动窗口和候选区域的方法不同，YOLO在训练和测试期间看到整个图像，因此它隐式地编码关于类的上下文信息及其外观。Fast R-CNN，一种顶级的检测方法，由于不能看到更多的上下文信息，错误地将图像中的背景块作为目标。与Fast R-CNN相比，YOLO产生的背景误差的概率不到一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
第三，YOLO学习对象的泛化表示。当在自然图像上训练并在艺术作品上测试时，YOLO在很大程度上优于DPM和R-CNN等顶级检测方法。由于YOLO具有高度的泛化性，当应用于新的领域或意外的输入时，它不太可能故障。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
YOLO 的准确度仍然落后于最先进的检测系统。虽然它可以快速识别图像中的对象，但它很难精确定位某些目标，特别是小型目标。我们在实验中进一步研究了如何对这些进行权衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我们所有的训练和测试代码都是开源的。还可以下载各种预训练模型。

2.Unified Detection

统一检测
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
我们将目标检测的单独部分统一到一个单一神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还可以同时预测一张图像中所有类别的所有边界框。这意味着我们的网络会对整个图像和图像中的所有目标进行全局性的分析。YOLO的设计可以实现端到端的训练和实时的速度，同时保持较高的平均精度。

Our system divides the input image into angrid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我们的系统将输入图像分割成的网格。如果一个物体的中心落在一个网格单元中，该网格单元负责检测该物体。

Each grid cell predictsbounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as. If

最低0.47元/天解锁文章

SF_ORION

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
YOLOv1（论文翻译）

You Only Look Once: Unified, Real-Time Object Detection你只看一次：统一的，实时的目标检测http://pjreddie.com/yolo/Abstract摘要We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we fra
复制链接

扫一扫