You Only Look Once: Unified, Real-Time Object Detection 论文

有钱要买小浣熊

已于 2023-11-25 12:40:25 修改

阅读量212

点赞数 1

文章标签：目标检测人工智能计算机视觉 yolov1 深度学习 YOLO

于 2023-11-24 14:20:30 首次发布

本文链接：https://blog.csdn.net/2302_79180616/article/details/134596560

版权

You Only Look Once: Unified, Real-Time Object Detection

YOLO的开山之作，文章是自己翻译的，个人水平有限有的地方不太通顺。

论文链接

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

我们提出了YOLO，一种新的目标检测方法。先前的目标检测研究使用分类器来执行检测。相反，我们将目标检测框定义为空间分离的边界框和相关类概率预测的回归问题。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

一个单阶段神经网络一次直接从原图中预测边界框和分类概率。因为整个检测流程是一个单一的网络，因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.

我们统一的架构速度非常快。我们基础的YOLO模型可以每秒处理45张图像。更小的网络，Fast YOLO ,在能达到其余实时检测模型的两倍map 的性能并可以每秒处理155张图片。与其他sota检测系统相比，YOLO会出现更多的定位错误但更少的背景上预测假真值。

Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

最后，YOLO可以学习物体的通用的表示。当从自然图像推广到艺术品等其他领域时，YOLO优于其它方法，包括DPM和R-CNN。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

人类扫过一张图片时便能立刻知道图片中有什么物体，物体的位置，以及他们是如何相互作用的。人类的视觉系统是快且准确的，让我们可以在有很少的意识就能执行复杂的任务如驾驶。快速，准确的目标检测算法能够让计算机在没有专门的传感器的情况下驾驶汽车，能够用辅助设备向人类传递实时场景信息，释放出用于通用，反应灵敏的机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image .

当前的检测系统使用分类器来实现检测。为了检测物体，这些系统使用一个分类器来对物体分类，并在测试图像的不同位置和尺度上对其进行评估。像可变形部件模型(DPM) 这样的系统使用滑动窗口法，其中分类器在整个图像均匀间隔位置运行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

更近的一些方法像R-CNN 使用候选区域方法来先在图像中生成潜在的检测框然后在这些预选区域运行分类器。在分类后，后处理用来优化检测框，消除重复的检测，并基于场景中的其它目标对检测框重新评分。这一复杂的流程速度慢且难以优化因为每一独立的组件需要单独训练。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们重新定义目标检测为单阶段的回归问题，直接从图像像素到包围盒坐标和类概率。使用我们的检测系统，你只需要看一次图像就可以预测物体是什么及其位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

YOLO简单得令人耳目一新：如图一。一个单阶段的卷积神经网络同时预测多个边界框和及这些框的类别概率。YOLO 训练在整张图片并直接优化检测性能。统一的模型与传统目标检测方法有一下几个优点。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems.

首先，YOLO非常快。因为我们将检测定义为一个回归问题。我们不需要复杂的流程。我们简单地运行我们神经网络

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

第二，YOLO在预测时能够对图像进行全局推理。不像滑动窗口和基于候选框，YOLO 在训练和测试时能看到整张图片，因此可以隐式编码关于类及其外观语义信息。Fast RCNN,一个top 检测算法，将背景误认为物体因为它不能看到大块的语义信息。YOLO更少地将背景分类为物体

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三，YOLO学习物体的通用表示。当训练在自然图像并测试在艺术图片时，YOLO 面对较大的域差异时优于顶尖的检测方法像 DPM 和RCNN。因为YOLO是高度泛化的，当应用在新的域或者不期望的输入时导致模式崩溃的可能较低。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO 在精度上仍然落后于sota检测系统。它能快速地识别图像中的物体，但是难以准确地定位某些物体，特别是小物体。我们在实验中进一步检查这些权衡。

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.

我们将目标检测的独立的组件统一到了一个的神经网络。我们的网络使用来自整图的特征来预测每个边界框。这同时整张图片中的所有类别的所有边界框。这意味着我们的网络对整个图像和图像中的所有对象进行全局推理。YOLO设计使得网络能在保持高的平均精度的同时端到端的训练和实时的推理速度。

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统将输入图像划分成SxS个栅格.如果物体的中心落在一个栅格中，则这个栅格负责检测这个物体。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as $Pr(Object)*IOU^{truth}_{pred}$ . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个栅格预测B个边界框和这些边界框的置信度。这些置信度反映模型有多么确信边界框中包含物体和边界框的准确率。我们将置信度定义为 $Pr(Object)*IOU^{truth}_{pred}$ . 如果没有物体存在这个网格中，置信度分数应该为0.否则我们想置信度等于预测框与真实框的IOU。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框包含5个预测值：x,y,w,h 和置信度。(x,y)坐标代表边界框的中心值，相对于栅格大小的。width 和 height 都是相对于整张图片大小。最后置信度代表预测框与真实框的IOU

Each grid cell also predicts C conditional class probabilities, $Pr(Class_i|Object).$

These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

每个栅格预测条件类别 C 概率 , $Pr(Class_i|Object).$ 这些概率在栅格包含物体的条件下计算的。我们仅对每个栅格预测一次类别概率，不管有几个边界框(因为每个格子开始可能会固定分配B个检测框，如果有物体仅选择置信度最大的检测框作为当前格子的预测框)。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

$Pr(Class_i|Object)*Pr(object)*IOU^{truth}_{pred} \\= Pr(Class_i)*IOU^{truth}{pred}$

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在推理时，我们将条件类别概率与个体置信度预测相乘。这给出了每个检测框的特定类别的置信度。这些置信度即编码了检测框的拟合程度也编码了类在框中出现的概率。

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.

我们的系统将检测建模为回归问题。将图像划分成SxS个栅格，每个栅格预测B个边界框，以及检测框的置信度，该栅格的类别概率。预测被编码为SxS(B*5+C). 5代表（x,y,w,h,confidence）,C是类别(长度取决于数据集的类别)，B是边界框的数量。

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.

在PASCAL VOC数据集上，我们使用S=7,B=2. PASCAL VOC 有20个类别。我们最后的输出的7x7x30的张量。

2.1 Network Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

我们将此模型实现为卷积神经网络，并在PASCAL VOC目标检测数据集上进行评估。网络最开始的卷积层从整张图像中提取特征，同时全连接层预测输出的概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

我们的网络架构受用于图像分类任务的GoogleNet模型的启发。我们的网络有24个卷积层并紧跟2个全连接层。不像GoogleNet中使用inception module ，我们仅使用1x1缩减层和3x3的卷积层。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

为了推动快速目标检测的发展，我们也训练了YOLO的快速版本。快速YOLO在网络中使用更少的卷积层(24减少到9)以及更少的滤波器。除了网络大小不一样，在训练和测试时的其余参数一样。

The final output of our network is the 7 × 7 × 30 tensor of predictions.

我们网络最后的输出是7x7x30的张量。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

我们卷积层在ImageNet 1000-class数据集上进行了预训练。我们使用了前20个卷积层并紧跟了一个平均池化层和一个全连接层。我们对该网络进行了大约一周的训练，在ImageNet 2012验证集上取得了88 %的单物体top - 5准确率，与Caffe Model Zoo中的GoogLeNet模型相当[ 24 ]。在推理时都是使用Darknet 框架。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

然后，我们将模型用来做检测。Ren等研究者证明添加卷积层和全连接层到预训练网络中能够提升性能。按照他们的例子，我们添加了四个卷积层和两个全连接层并随机初始化。检测往往要求细粒度视觉信息，所以我们增加了输入图像的分辨率，从224x224到448x448.

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

我们最后的全连接层预测类别概率和边界框坐标。我们用图像的宽高正则化边界框的宽高，所以它们的值在0-1之间。我们将边界的框的中心坐标(x,y)参数化为特定网格单元位置的偏移量，因此它们也在0和1之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

$\phi(x) = \begin{cases}x,if x>0;\\ 0.1x, otherwise. \end{cases}$

我们在最后一层使用线性激活函数，其余层都使用公式中的leakyReLU。

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

我们使用误差平方和来优化模型，原因是误差平方和容易优化，然而它不能够很好地匹配我们最大化平均精确的目标。它将定位误差和分类误差的权重看作是相同的，这并不理想。在每张图片中很多栅格并不包含物体。希望将不包含物体的栅格的置信度趋向于0，往往会大于包含物体的栅格的梯度(也就是说不包含物体的栅格比包含物体的栅格多，所累加的梯度会大于包含物体的栅格的梯度)。这会导致模型的训练不稳定，导致训练过早的发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.

为了解决这个问题，我们增加了含有边界框的坐标预测的损失权重，减少了不含物体的置信度的损失权重。我们使用了两个参数， λcoord and λnoobj 。 λcoord=5， λnoobj=0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

误差平方和也平等看待了大的边界框和小的边界框。我们误差指标应该要反映大边界框中的小偏差比小边界框中的小偏差更重要。为了解决这个问题，我们预测边界框的宽高的平方根，而不是直接预测宽高。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO在每个栅格中预测多个检测框。在训练时我们仅需要一个边界框来负责一个物体。我们基于与当前物体真实框有最高IOU的预测框原则选出一个预测框来负责预测这个物体。这将导致边界框预测器之间的专门化。每个预测其都能更好地预测物体的某些尺寸，纵横比或类别，提高整体召回率。

During training we optimize the following, multi-part loss function:

$\lambda_{coord}\sum_{i=0}^{S^2}{\sum_{j=0}^B{{1}_{ij}^{obj}[(x_i-\hat{x}i)^2+(y_i-\hat{y}i)^2]}}\\+\lambda_{coord}\sum_{i=0}^{S^2}{\sum_{j=0}^B{{1}_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\hat{w}i})^2+(\sqrt{y_i}-\sqrt{\hat{y}i})^2]}}\\+\sum_{i=0}^{S^2}{\sum_{j=0}^B{{1}_{ij}^{obj}(C_i-\hat{C}i)^2}}\\+\lambda_{noobj}\sum_{i=0}^{S^2}{\sum_{j=0}^B{{1}_{ij}^{noobj}(C_i-\hat{C}i)^2}}\\+\sum_{i=0}^{S^2}{{1}_{i}^{obj}\sum_{c\in classes}(p_i(c)-\hat{p_i}(c))^2}$

where ${1}_{ij}^{obj}$ denotes if object appears in cell i and ${1}_{ij}^{obj}$ denotes that the jth bounding box predictor in cell i is “responsible” for that prediction.

这里 ${1}_{ij}^{obj}$ 代表如果物体出现在栅格i 和在栅格i的第j个边界框负责预测这个物体。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

值得注意的是，损失函数仅惩罚了分类误差如果物体出现在某个栅格中（因此条件类别概率是之前讨论过的)。这也仅惩罚了负责预测正式框的预测框的坐标误差(也就是具有与真实框有最高IOU的)

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.

我们在PASCAL VOC 2007和2012的训练和验证数据集上训练了大约135个轮次。在2012年测试时，我们还将VOC 2007测试数据纳入训练集。在整个训练过程中我们使用批大小为64，动量为0.9和衰减为0.0005。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−3 to 10−2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.

我们的学习率安排如下：对于第一个轮次，我们将学习速率从10 - 3慢慢提高到10 - 2。如果我们以较高的学习率开始，我们的模型往往由于不稳定的梯度而发散。我们继续以10 - 2训练75个轮，然后以10 - 3训练30个轮，最后以10 - 4训练30个轮。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

为了避免过拟合，我们使用了dropout 和数据增强策略。Dropout 层rate=0.5,接在第一个全连接层后面来防止层与层之间的协同适应。

我们引入了高达原始图像大小20 %的随机缩放和平移。我们还在HSV颜色空间中随机调整图像的曝光度和饱和度至多1.5倍。

2.3. Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

像训练一样，预测测试图像的检测框仅需要一个网络。在PASCAL VOC数据集上，每张图片预测98 个边界框，和每一个框的条件类别概率。YOLO 是非常快的因为仅需要一个网络评估，不像基于分类器的方法。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 23% in mAP.

栅格设计在边界框预测中强制实现空间多样性。这很清楚的知道一个物体落在哪个栅格。并且网络只为每个预测一个框。然而大物体或者物体接近于多个格子的边界上可以被多个格子定位。NMS非极大值抑制可以用来固定检测框。虽然对于R-CNN和DPM而言，非极大值抑制对性能不重要，但在map中增加了23%。

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

YOLO强加在边界框很强的空间约束，因为每个格子仅预测两个边界框并只有一个类别。这种强的约束限制了我们模型可以预测的相邻物体的数量。我们的方法对群体中的小目标效果不好，比如说鸟群。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由于我们的模型学习从数据中预测边界框，所以它很难泛化到新的或不寻常的纵横比或配置中的对象。我们的模型常使用相对粗的特征来预测检测框因为我们的架构从输入图像开始有很多的下采样层。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

当我们训练一个近似检测性能的损失函数时，我们的损失函数在小检测框和大检测框中看待错误相同。小误差在大检测框一般是良性的，但小检测框的小误差对IOU的影响要大得多。我们得主要误差来源是错误定位。