You Only Look Once: Unified, Real-Time Object Detection阅读笔记

疫情原因,学校新开设了一门文献阅读。唉,现在只好读一篇记录一点,最后还要写一万五千字的文献综述,绝望啊。

今天的论文是 :You Only Look Once: Unified, Real-Time Object Detection

全文和翻译:

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我们在这里解释介绍一种新的对象检测方法YOLO。先前目标检测的工作将分类器用于执行检测。而本文将目标检测作为边界框(用于空间划分)和相关类概率的回归问题。本文采用了单个神经网络,可以在一次评估中从整幅图像直接预测边界框和类概率。由于检测采用了单网络,对检测性能的优化过程是端到端进行的。这个统一的框架体系运行速度非常快。我们的基本YOLO模型,以每秒45帧的速度实时处理图像。较小的网络版本Fast YOLO每秒可处理155帧,同时达到其他实时检测器的两倍mAP。与最新的检测框架相比,YOLO会产生更多的定位错误,但预测背景假阳性的可能性较小。最后,YOLO学习的是通用的物体表达形式。从自然图像到绘画等其他领域,它的性能明显优于其他检测方法(DPM和R-CNN)。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

More recent approaches like R-CNN use region proposalmethods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

人类看一眼图像,立即知道图像中有什么对象,它们在哪里以及如何相互作用。人类的视觉系统快速准确,使我们能够执行一些复杂的任务,例如在没有意识的情况下驾驶。快速,准确的物体检测算法,将允许计算机在没有专用传感器的情况下驾驶汽车,使辅助设备向人类用户传达实时场景信息,并释放通用响应型机器人系统的潜力。

当前的检测系统将分类器重新利用来执行检测。为了检测物体,这些系统采用了该物体的分类器,并在测试图像的各个位置和比例上对其进行了评估。像DPM模型之类的系统在滑动窗口中运行分类器,并均匀遍历整个图像[10]。

很多方法(R-CNN),首先在图像中生成潜在的边界框,然后在这些建议框上运行分类器。分类后,再使用后处理方法来优化边界框,消除重复检测,并根据场景中的其他对象对这些框进行重新评分[13]。这些框架运行缓慢,且难以优化,因为每个单独的部分都必须分别进行培训。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们将对象检测重新构造为一个回归问题,直接从图像得到边界框坐标和类概率。使用我们的系统,您只需看一次(YOLO)图像即可预测存在的物体及其位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

YOLO非常简单:参见图1。单个卷积网络可同时预测多个边界框和这些框的类概率。YOLO训练完整图像并直接优化检测性能。与传统的对象检测方法相比,此统一模型具有多个优点。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

首先,YOLO非常快。由于我们将检测框架视为回归问题,因此不需要复杂的流程。我们只需在测试时在新图像上运行神经网络即可预测检测结果。我们的基本网络以每秒45帧的速度运行,在Titan X GPU上不用批处理,而快速版本的运行速度超过150 fps。这意味着我们可以实时处理视频流,而延时不到25毫秒。此外,YOLO达到了其他实时系统平均精度的两倍以上。在网络摄像头上实时运行系统的演示,请参阅项目网页:http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

其次,YOLO在做出预测时会考虑全局图像。与基于滑动窗口和区域提议的技术不同,YOLO在训练和测试期间会看到整个图像,因此它隐式地编码有关类及其外观的内容信息。Fast R-CNN这种检测方法[14],因为看不到较大视野,因此将图像中的背景色块误认为是对象。与Fast R-CNN相比,YOLO产生的背景错误减少了一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

All of our training and testing code is open source. A variety of pretrained models are also av

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值