YOLOv2论文翻译详解

最新推荐文章于 2023-11-27 14:20:09 发布

肖飒风

最新推荐文章于 2023-11-27 14:20:09 发布

阅读量1k

点赞数

分类专栏： darknet

darknet 专栏收录该内容

35 篇文章 1 订阅

订阅专栏

YOLO9000:Better, Faster, Stronger

论文地址；代码地址

摘要：

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

翻译： 我们介绍了YOLO9000，这是一种先进的实时对象检测系统，可以检测9000多个对象类别。首先，我们通过创新和之前的成果对YOLO检测方法进行各种改进。改进的模型YOLOv2被应用在了PASCAL VOC和COCO之类的标准检测任务中。YOLOv2使用了多尺度的训练方法，相同的YOLOv2模型可以在不同尺寸的数据集上运行，从而在速度和准确性之间轻松权衡。以67 FPS速度运行时，YOLOv2在VOC 2007上获得76.8 mAP。以40 FPS速度运行时，精度达到了78.6 mAP，优于采用ResNet和SSD的Faster RCNN的最新方法，但运行速度仍显着提高。最后，我们提出了一种联合训练目标检测和分类的方法。使用这种方法，我们在COCO检测数据集和ImageNet分类数据集上同时训练YOLO9000。通过我们的联合培训，YOLO9000可以预测没有标记检测数据的物体类别的检测结果。我们验证了ImageNet检测任务的方法。尽管仅拥有200个类别中的44个类别的检测数据，但YOLO9000在ImageNet检测验证集上获得了19.7 mAP。在COCO以外的156个类别中，YOLO9000的平均得分为16.0。YOLO不仅仅只能检测到200多个类。它还可以预测9000多种不同物体类别的检测。而且它仍然实时运行。

1. Introduction（简介）

General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects.
Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags. Classification datasets have millions of images with tens or hundreds of thousands of categories.
We would like detection to scale to level of object classification. However, labelling images for detection is farmore expensive than labelling for classification or tagging(tags are often user-supplied for free). Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future.
翻译： 通用目标检测应该快速，准确并且能够识别各种目标。自从引入神经网络以来，检测框架已经变得越来越快和准确。但是，大多数检测方法仍然局限于一小部分对象。

与其他任务（例如分类和标记）的数据集相比，当前的对象检测数据集受到限制。最常见的检测数据集包含成千上万的图像以及数十到数百个标签。分类数据集包含数百万个具有数万或数十万个类别的图像。

我们希望将检测扩展到对象分类的级别。但是，标记要检测的图像比标记分类或标记要昂贵得多（标记通常是用户免费提供的）。因此，我们不太可能在不久的将来看到与分类数据集规模相同的检测数据集。

We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together.
We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness.
Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO.
翻译： 我们提出了一种新方法，以利用我们已经拥有的大量分类数据，并使用它来扩展当前检测系统的范围。我们的方法使用对象分类的分层视图，该视图允许我们将不同的数据集组合在一起。

我们还提出了一种联合训练算法，该算法允许我们在检测和分类数据上训练对象检测器。我们的方法利用标记的检测图像来学习精确定位对象，同时使用分类图像来增加其词汇量和鲁棒性。

使用这种方法，我们训练了YOLO9000，这是一种实时对象检测器，可以检测9000多种不同的对象类别。首先，我们对基本的YOLO检测系统进行改进，以训练出最先进的实时检测器YOLOv2。然后，我们使用数据集组合方法和联合训练算法对ImageNet上的9000多个类别以及COCO的检测数据进行训练。
我们提出了一种新方法来利用我们已经拥有的大量分类数据，并使用它来扩大当前检测系统的范围。我们的方法使用目标分类的层次视图，允许我们将不同的数据集合在一起。

我们的所有代码和预训练模型都可以在http://pjreddie.com/yolo9000/在线获得。
在这里插入图片描述

2. Better（更好）

YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy.
Computer vision generally trends towards larger, deeper networks. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2.
翻译： 与最先进的检测系统相比，YOLO存在许多缺点。与Fast R-CNN相比，YOLO的错误分析表明，YOLO会产生大量的定位错误。此外，与基于区域提案的方法相比，YOLO的召回率相对较低。因此，我们主要致力于改善召回率和定位，同时保持分类准确性。

计算机视觉通常趋向于更大，更深的网络。更好的性能通常取决于训练大型网络或将多个模型整合在一起。但是，对于YOLOv2，我们希望使用一种仍能快速运行的更准确的检测器。除了没有加深网络之外，我们还简化了网络，然后使抽象表示更易于学习。我们将过去的工作中的各种想法与我们自己的新颖概念融合在一起，以提高YOLO的性能。结果摘要可在表2中找到。

Batch Normalization. Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization. By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in mAP. Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting.

批量标准化。批标准化处理可显着提高收敛性，同时消除了对其他形式的规范化的需求。通过在YOLO的所有卷积层上添加批处理归一化，我们可以在mAP方面获得超过2％的改善。批量标准化还有助于规范化模型。通过批量归一化，我们可以移除模型中的dropout 层而不会发生过拟合。

High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet . Starting with AlexNet most classifiers operate on input images smaller than 256 × 256. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.
For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP.

高分辨率分类器。所有的最先进的检测方法都使用在ImageNet上预先训练好的分类器。从AlexNet开始，大多数分类器用小于256×256的图像作为输入。最初的YOLO以224×224的图像训练分类器网络，并将分辨率提高到448以进行检测训练。这意味着网络必须切换到目标检测的学习，同时能调整到新的输入分辨率。

对于YOLOv2，我们首先以448×448的全分辨率在ImageNet上进行10个迭代周期的微调。这给予网络一些时间，以调整其滤波器来更好地处理更高分辨率的输入。然后，我们根据检测结果对网络进行微调。这种高分辨率分类网络使我们的mAP几乎提高了4％。

Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors . Using only convolutional layers the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes. Since the prediction layer is convolutional, the RPN predicts these offsets at every location in a feature map. Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.
We remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First we eliminate one pooling layer to make the output of the network’s convolutional layers higher resolution. We also shrink the network to operate on 416 input images instead of 448×448. We do this because we want an odd number of locations in our feature map so there is a single center cell. Objects, especially large objects, tend to occupy the center of the image so it’s good to have a single location right at the center to predict these objects instead of four locations that are all nearby. YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13.
When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object.
Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate model gets 69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase in recall means that our model has more room to improve.

具有锚点框的卷积。 YOLO直接使用卷积特征提取器顶部的全连接层来预测边界框的坐标。 Fast R-CNN不是直接预测坐标，而是使用手工选取的先验来预测边界框。 Faster R-CNN中的候选区域生成网络（RPN）仅使用卷积层来预测锚框的偏移和置信度。由于预测层是卷积的，所以RPN可以在特征图中的每个位置预测这些偏移。使用预测偏移代替坐标，可以简化问题并使网络更易于学习。

我们从YOLO中移除全连接层，并使用锚框来预测边界框。首先我们消除一个池化层，以使网络卷积层的输出具有更高的分辨率。我们还缩小网络，使其在分辨率为416X416的输入图像上运行，而不是448×448。我们这样做是因为我们想要在特征图中有奇数个位置，只有一个中心单元。目标，尤其是大的目标，往往占据图像的中心，所以最好在正中心拥有单独一个位置来预测这些目标，而不是在中心附近的四个位置。 YOLO的卷积层将图像下采样32倍，所以通过使用416的输入图像，我们得到13×13的输出特征图。

引入锚框后，我们将类预测机制与空间位置分开处理，单独预测每个锚框的类及其目标。遵循原来的YOLO的做法，目标预测依然预测了真实标签框（ground truth box）和候选框的IOU，而类别预测也是预测了当有目标存在时，该类别的条件概率。

使用锚框，精确度会小幅下降。因为原始的YOLO仅为每个图片预测98个框，但使用锚框后，我们的模型预测的框数超过一千个。如果没有锚框，我们的中等模型将获得69.5 的mAP，召回率为81％。使用锚框，我们的模型获得了69.2 的mAP，召回率为88％。尽管mAP减少，但召回率的增加意味着我们的模型有更大的改进空间。
在这里插入图片描述
图2：VOC和COCO上的聚类框尺寸。我们在边界框的维上运行k-means聚类，以获得我们模型的良好先验框。左图显示了我们通过k的各种选择获得的平均IOU。我们发现k = 5为召回与模型的复杂性提供了良好的折衷。右图显示了VOC和COCO的相对质心。这两种方案都喜欢更薄，更高的框，而COCO的尺寸的多变性比VOC更大。

Dimension Clusters. We encounter two issues with anchor boxes when using them with YOLO. The first is that the box dimensions are hand picked. The network can learn to adjust the boxes appropriately but if we pick better priors for the network to start with we can make it easier for the network to learn to predict good detections.
Instead of choosing priors by hand, we run k-means clustering on the training set bounding boxes to automatically find good priors. If we use standard k-means with Euclidean distance larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for our distance metric we use: d(box, centroid) = 1 - IOU(box, centroid)
We run k-means for various values of k and plot the average IOU with closest centroid, see Figure 2. We choose k = 5 as a good tradeoff between model complexity and high recall. The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes.
We compare the average IOU to closest prior of our clustering strategy and the hand-picked anchor boxes in Table 1. At only 5 priors the centroids perform similarly to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. If we use 9 centroids we see a much higher average IOU. This indicates that using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn.

**维度聚类。**当把锚框与YOLO一起使用时，我们会遇到两个问题。首先是先验框的尺寸是手工挑选的。虽然网络可以通过学习适当地调整方框，但是如果我们从一开始就为网络选择更好的先验框，就可以让网络更容易学习到更好的检测结果。

我们不用手工选择先验框，而是在训练集的边界框上运行k-means，自动找到良好的先验框。如果我们使用具有欧几里得距离的标准k-means，那么较大的框比较小的框产生更多的误差。然而，我们真正想要的是独立于框的大小的，能获得良好的IOU分数的先验框。因此对于距离度量我们使用：

d(box, centroid) = 1 - IOU(box, centroid)

我们用不同的k值运行k-means，并绘制最接近质心的平均IOU（见图2）。为了在模型复杂度和高召回率之间的良好折衷，我们选择k = 5。聚类的质心与手工选取的锚框显着不同，它有更少的短且宽的框，而且有更多既长又窄的框。

表1中，我们将聚类策略的先验中心数和手工选取的锚框数在最接近的平均IOU上进行比较。仅5个先验中心的平均IOU为61.0，其性能类似于9个锚框的60.9。使用9个质心会得到更高的平均IOU。这表明使用k-means生成边界框可以更好地表示模型并使其更容易学习。

在这里插入图片描述
表1：VOC 2007最接近先验的框的平均IOU。VOC 2007上的目标的平均IOU与其最接近的，未经修改的使用不同生成方法的目标之间的平均IOU。聚类结果比使用手工选取的先验框结果要好得多。

Direct location prediction. When using anchor boxes with YOLO we encounter a second issue: model instability, especially during early iterations. Most of the instability comes from predicting the (x, y) locations for the box. In region proposal networks the network predicts values tx and ty and the (x, y) center coordinates are calculated as:
x = (tx ∗ wa) ) xa y = (ty ∗ ha) ) ya
For example, a prediction of tx = 1 would shift the box to the right by the width of the anchor box, a prediction of tx = = 1 would shift it to the left by the same amount.
This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets.
Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range.
The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box, tx, ty, tw, th, and to. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:
在这里插入图片描述

Since we constrain the location prediction the parametrization is easier to learn, making the network more stable. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes.

直接位置预测。当在YOLO中使用锚点框时，我们会遇到第二个问题：模型不稳定，尤其是在早期迭代的过程中。大多数不稳定来自于预测框的（x，y）位置。 在候选区域网络中，网络预测的 tx,ty，和中心坐标（x，y）计算如下：
在这里插入图片描述
这个公式有误，作者应该是把加号写成了减号。理由如下，anchor的预测公式来自于Faster-RCNN，我们来看看人家是怎么写的：

公式中，符号的含义解释一下：x 是坐标预测值，x_a 是anchor坐标（预设固定值），x^∗ 是坐标真实值（标注信息），其他变量 y，w，h 以此类推，t 变量是偏移量。然后把前两个公式变形，就可以得到正确的公式：
在这里插入图片描述
这个公式的理解为：当预测 t_x=1，就会把box向右边移动一定距离（具体为anchor box的宽度），预测 t_x=−1，就会把box向左边移动相同的距离。

这个公式没有任何限制，使得无论在什么位置进行预测，任何anchor boxes可以在图像中任意一点结束（我的理解是，t_x 没有数值限定，可能会出现anchor检测距离其很远的目标box的情况，效率比较低。正确做法应该是每一个anchor只负责检测周围正负一个单位以内的目标box）。模型随机初始化后，需要花很长一段时间才能稳定预测敏感的物体位置。

在此，作者就没有采用预测直接的offset的方法，而使用了预测相对于grid cell的坐标位置的办法，作者又把ground truth限制在了0到1之间，利用logistic回归函数来进行这一限制。

现在，神经网络在特征图（13 *13 ）的每个cell上预测5个bounding boxes（聚类得出的值），同时每一个bounding box预测5个坐标值，分别为 t_x,t_y,t_w,t_h,t_o，其中前四个是坐标，t_o是置信度。如果这个cell距离图像左上角的边距为(c_x,c_y) 以及该cell对应box（bounding box prior）的长和宽分别为 (p_w,p_h)，那么预测值可以表示为：

在这里插入图片描述

这几个公式参考上面Faster-RCNN和YOLOv1的公式以及下图就比较容易理解。t_x,t_y 经sigmod函数处理过，取值限定在了0~1，实际意义就是使anchor只负责周围的box，有利于提升效率和网络收敛。σ 函数的意义没有给，但估计是把归一化值转化为图中真实值，使用 e 的幂函数是因为前面做了 ln 计算，因此，σ(t_x)是bounding box的中心相对栅格左上角的横坐标，σ(t_y)是纵坐标，σ(t_o)是bounding box的confidence score。

定位预测值被归一化后，参数就更容易得到学习，模型就更稳定。作者使用Dimension Clusters和Direct location prediction这两项anchor boxes改进方法，mAP获得了5%的提升。
在这里插入图片描述
图3：具有维度先验和位置预测的边界框。我们预测框的宽度和高度作为聚类质心的偏移量。我们使用sigmoid函数预测相对于滤波器应用位置的框的中心坐标。

Fine-Grained Features.This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.
The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features. Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest 1% performance increase.

**细粒度功能。**修改后的YOLO在13×13特征图上预测检测结果。虽然这对于大型物体是足够的，但使用更细粒度特征对定位较小物体有好处。Faster R-CNN和SSD都在网络中的各种特征图上运行网络，以获得多个分辨率。我们采取不同的方法，只需添加一个直通层，以26×26的分辨率从较早的层中提取特征。

直通层将高分辨率特征与低分辨率特征连接起来，将相邻特征叠加到不同的通道中，而不是空间位置上，类似于ResNet中的恒等映射。将26×26×512的特征图变为13×13×2048的特征图，然后就可以与原来的特征连接。我们的检测器运行在这张扩展的特征图的顶部，以便它可以访问细粒度的功能。这使性能提高了1％。

在这里插入图片描述

Multi-Scale Training. The original YOLO uses an input resolution of 448 × 448. With the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model.
Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, …, 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.
This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions. The network runs faster at smaller sizes so YOLOv2 offers an easy tradeoff between speed and accuracy.
At low resolutions YOLOv2 operates as a cheap, fairly accurate detector. At 288 × 288 it runs at more than 90 FPS with mAP almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high framerate video, or multiple video streams.
At high resolution YOLOv2 is a state-of-the-art detector with 78.6 mAP on VOC 2007 while still operating above real-time speeds. See Table 3 for a comparison of YOLOv2 with other frameworks on VOC 2007. Figure 4

多尺度训练。原来的YOLO使用448×448的输入分辨率。通过添加锚框，我们将分辨率更改为416×416。但是，由于我们的模型仅使用卷积层和池化层，因此可以实时调整大小。我们希望YOLOv2能够在不同尺寸的图像上运行，因此我们可以将多尺度训练应到模型中。

我们不需要修改输入图像大小，而是每隔几次迭代就改变一次网络。每10个批次我们的网络会随机选择一个新的图像尺寸大小。由于我们的模型缩减了32倍，所以我们从32的倍数中抽取：{320,352，…，608}。因此，最小的选项是320×320，最大的是608×608。我们调整网络的尺寸并继续训练。

这个策略迫使网络学习如何在各种输入维度上做好预测。这意味着相同的网络可以预测不同分辨率下的检测结果。网络在较小的尺寸下运行速度更快，因此YOLOv2在速度和准确性之间提供了一个简单的折衷。

在低分辨率下，YOLOv2作为一种便宜但相当准确的检测器工作。在288×288情况下，它的运行速度超过90 FPS，而mAp几乎与Fast R-CNN一样好。这使其成为小型GPU，高帧率视频或多视频流的理想选择。

在高分辨率下，YOLOv2是一款先进的检测器，在VOC2007上获得了78.6的mAP，同时仍以高于实时速度运行。请参阅表3，了解YOLOv2与其他框架在VOC 2007上的比较
在这里插入图片描述
图4：VOC 2007上的精度和速度
Further Experiments. We train YOLOv2 for detection on VOC 2012. Table 4 shows the comparative performance of YOLOv2 versus other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running far faster than competing methods. We also train on COCO and compare to other methods in Table 5. On the VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP, comparable to SSD and Faster R-CNN.

进一步的实验。我们在VOC 2012上训练YOLOv2进行检测。表4显示了YOLOv2与其他最先进的检测系统的性能比较。 YOLOv2运行速度远高于对手，且精度达到73.4 mAP。我们还在COCO上训练，并与表5中的其他方法进行比较。使用VOC度量（IOU = 0.5），YOLOv2获得44.0 mAP，与SSD和Faster R-CNN相当。

在这里插入图片描述
表3：PA S C A L VOC 2007的检测框架。YOLOv2比以前的检测方法更快，更准确。它也可以以不同的分辨率运行，以便在速度和准确性之间轻松折衷。每个YOLOv2项实际上都是具有相同权重的相同训练模型，只是以不同的大小进行评估。所有的时间的测试都运行在Geforce GTX Titan X（原始的，而不是Pascal模型）

3. Faster

We want detection to be accurate but we also want it to be fast. Most applications for detection, like robotics or selfdriving cars, rely on low latency predictions. In order to maximize performance we design YOLOv2 to be fast from the ground up.
翻译：我们希望检测结果准确，但我们也希望检测速度更快。大多数用于检测的应用程序（如机器人或自动驾驶汽车）都依赖于低延迟预测。为了最大限度地提高性能，我们从头开始设计YOLOv2。
在这里插入图片描述
表2：从YOLO到YOLOv2的路径。大多数列出的设计决策都会导致MAP显着增加。有两个例外情况是：切换到带有锚框的全卷积网络和使用新网络。切换到锚框方法增加召回率，而不改变mAP，而使用新网络削减33％的计算。

在这里插入图片描述
表4：PASCAL VOC2012测试检测结果。YOLOv2与采用ResNet和SSD512的Faster R-CNN等先进检测器性能相当，速度提高2至10倍。

Most detection frameworks rely on VGG-16 as the base feature extractor [17]. VGG-16 is a powerful, accurate classification network but it is needlessly complex. The convolutional layers of VGG-16 require 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution.
The YOLO framework uses a custom network based on the Googlenet architecture [19]. This network is faster than VGG-16, only using 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG- 16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets 88.0% ImageNet compared to 90.0% for VGG-16.
翻译： 大多数检测框架依赖于VGG-16作为基本特征提取器。 VGG-16是一个功能强大，准确的分类网络，但它有不必要的复杂度。 VGG-16的卷积层在一个224×224分辨率单个图像上运行一次需要306.90亿浮点运算。

YOLO框架使用基于Googlenet架构的自定义网络。这个网络比VGG-16更快，一次前向传播只要85.2亿次运行。然而，它的准确性略低于VGG-16。在Imagenet上，用224×224的单张裁剪图像，YOLO的自定义模型的精度为88.0％而VGG-16则为90.0％。

Darknet-19. We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step . Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions . We use batch normalization to stabilize training, speed up convergence, and regularize the model .
Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

Darknet-19。我们提出了一个新的分类模型作为YOLOv2的基础。我们的模型建立在网络设计的先前工作以及该领域的常识上。与VGG模型类似，我们大多使用3×3滤波器，并且在池化层步骤后使用两倍的通道数。按照Network in Network（NIN）的方法，我们使用全局平均池化来做预测，并使用1×1滤波器来压缩3×3卷积的特征表示。我们使用批量归一化来稳定训练，加速收敛，并规范模型。

最终的模型叫做Darknet-19，它有19个卷积层和5个Maxpool层。 Darknet-19只需要55.8亿次操作来处理图像，但在ImageNet上实现了72.9％的top-1精度和91.2％的top-5精度。

Training for classification. We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework . During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.
As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 10 3. At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

分类训练。我们使用DarkNet神经网络框架，使用随机梯度下降，初始学习率为0.1，多项式速率衰减为4，权重衰减0.0005，动量为0.9，在标准ImageNet 1000类别分类数据集上对网络进行160个迭代周期的训练。在训练过程中，我们使用标准数据增强技巧，包括随机截取，旋转和改变色相，饱和度和曝光。

如上所述，在我们对224×224图像进行初始训练之后，我们用更大的分辨率（448）对网络进行了微调。微调时，我们使用上述参数进行训练，但仅用10个周期，并且开始时的学习率为10-3。在这个更高的分辨率下，我们的网络实现了76.5％的top-1精度和93.3％的top-5精度。

Training for detection. We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features. We train the network for 160 epochs with a starting learning rate of 10 3 , dividing it by 10 at 60 and 90 epochs.We use a weight decay of 0.0005 and momentum of 0.9.
We use a similar data augmentation to YOLO and SSD withrandom crops, color shifting, etc. We use the same training
strategy on COCO and VOC.

检测训练。我们这样修改网络：去除最后一个卷积层，然后添加三个具有1024个滤波器的3X3的卷积层，然后在最后添加1×1卷积层，该层的滤波器数量是检测需要的输出数量。对于VOC，我们预测5个边界框，每个边界框有5个坐标和20个类别，所以有125个滤波器。我们还添加了从最后的3×3×512层到倒数第二层卷积层的直通层，以便我们的模型可以使用细粒度特征。

我们训练网络160个迭代周期，初始学习率为10-3，在60和90周期除以10。我们使用0.0005的权值衰减和0.9的动量。我们对YOLO和SSD进行类似的数据增强，随机裁剪，色彩修改等。我们对COCO和VOC使用相同的训练策略。
在这里插入图片描述
表5：COCO test-dev集上的结果，来源于论文【11】

4. Stronger

We propose a mechanism for jointly training on classi-fication and detection data. Our method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images with only class labels to expand the number of categories it candetect.
During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classificationspecific parts of the architecture.

翻译： 我们提出了一个联合训练分类和检测数据的机制。我们的方法使用了用于检测的图像来学习检测特定信息，如边界框坐标预测和目标以及如何对常见目标进行分类。通过使用仅具有类标签的图像来扩展其可检测类别的数量。

在训练期间，我们混合来自检测和分类数据集的图像。当我们的网络看到标记为检测的图像时，可以根据完整的YOLOv2损失函数进行反向传播。当它看到分类图像时，只会反向传播分类部分的损失。

在这里插入图片描述
This approach presents a few challenges. Detection datasets have only common objects and general labels, like “dog” or “boat”. Classification datasets have a much wider and deeper range of labels. ImageNet has more than a hundred breeds of dog, including “Norfolk terrier”, “Yorkshire terrier”, and “Bedlington terrier”. If we want to train on both datasets we need a coherent way to merge these labels.
Most approaches to classification use a softmax layer across all the possible categories to compute the final probability distribution. Using a softmax assumes the classes are mutually exclusive. This presents problems for combining datasets, for example you would not want to combine ImageNet and COCO using this model because the classes “Norfolk terrier” and “dog” are not mutually exclusive.
We could instead use a multi-label model to combine the datasets which does not assume mutual exclusion. This approach ignores all the structure we do know about the data, for example that all of the COCO classes are mutually exclusive.

翻译： 这种方法带来了一些难题。检测数据集只有常用的目标和通用的标签，如“狗”或“船”。分类数据集具有更广泛和更深入的标签范围。 ImageNet拥有多种犬种，包括Norfolk terrier，Yorkshire terrier和Bedlington terrier。如果我们想在两个数据集上进行训练，则需要采用一致的方式来合并这些标签。

大多数分类方法使用涵盖所有可能类别的softmax层来计算最终概率分布。使用softmax，意味着类是相互排斥的。这给组合数据集带来了问题，例如，你不能用这个模型来组合ImageNet和COCO，因为类Norfolk terrier和dog不是互斥的。

相反，我们可以使用多标签模型来组合不会互相排斥的数据集。这个方法忽略了我们所知道的关于数据的所有结构，例如所有的COCO类都是相互独立的。

Hierarchical classification. ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate . In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc. Most approaches to classification assume a flat structure to the labels however for combining datasets, structure is exactly what we need.
WordNet is structured as a directed graph, not a tree, because language is complex. For example a “dog” is both a type of “canine” and a type of “domestic animal” which are both synsets in WordNet. Instead of using the full graph structure, we simplify the problem by building a hierarchical tree from the concepts in ImageNet.
To build this tree we examine the visual nouns in ImageNet and look at their paths through the WordNet graph to the root node, in this case “physical object”. Many synsets only have one path through the graph so first we add all of those paths to our tree. Then we iteratively examine the concepts we have left and add the paths that grow the tree by as little as possible. So if a concept has two paths to the root and one path would add three edges to our tree and the other would only add one edge, we choose the shorter path.

分层分类。 ImageNet标签是从WordNet中提取的，WordNet是一个构建概念及其相互关系的语言数据库[12]。 Norfolk terrier和Yorkshire terrier都是terrier的下义词，terrier是一种hunting dog，hunting dog是dog，dog是canine等。大多数分类的方法假设标签是一个扁平结构，但是对于组合数据集，结构正是我们所需要的。

WordNet的结构是有向图，而不是树，因为语言很复杂。例如，“狗”既是一种“犬”又是一种“家养动物”，它们都是WordNet中的同义词。我们不使用完整的图结构，而是通过从ImageNet中的概念构建分层树来简化问题。

WordNet的结构是有向图，而不是树，因为语言很复杂。例如，一只狗既是一种犬科动物，又是一种家养动物，它们都是WordNet中的同种动物。我们没有使用完整的图结构，而是通过从ImageNet中的概念构建分层树来简化问题。

为了构建这棵树，我们检查ImageNet中的视觉名词，并查看它们通过WordNet图到根节点的路径，在这种情况下是“物理目标”。许多同义词只有在图上一条路径，所以首先我们将所有这些路径添加到我们的树中。然后，我们反复检查我们留下的概念，并尽可能少地添加生成树的路径。所以如果一个概念有两条通向根的路径，一条路径会为我们的树增加三条边，另一条路只增加一条边，我们选择较短的路径。

最终的结果是WordTree，一个视觉概念的分层模型。为了使用WordTree进行分类，我们预测每个节点的条件概率，以得到同义词集合中每个同义词下义词的概率。例如，在terrier节点我们预测：

在这里插入图片描述
如果我们想要计算一个特定节点的绝对概率，我们只需沿着通过树到达根节点的路径，再乘以条件概率。所以如果我们想知道一张图片是否是Norfolk terrier，我们计算：

在这里插入图片描述
为了实现分类，我们假定图像包含一个目标：
为了验证这种方法，我们在使用1000类ImageNet构建的WordTree上训练Darknet-19模型。为了构建WordTree1k，我们添加了所有中间节点，将标签空间从1000扩展到1369。在训练过程中，我我们将真实标签向树上面传播，以便如果图像被标记为Norfolk terrier，则它也被标记为dog和mamal等。为了计算条件概率，我们的模型预测了1369个值的向量，并且我们计算了相同概念的下义词在所有同义词集上的softmax，见图5。

使用与以前相同的训练参数，我们的分层Darknet-19达到了71.9％的top-1精度和90.4％的top-5精度。尽管增加了369个附加概念，并且我们的网络预测了树状结构，但我们的精度仅略有下降。以这种方式进行分类也有若干好处。在新的或未知的目标类别上，性能会优雅低降低。例如，如果网络看到一张狗的照片，但不确定它是什么类型的狗，它仍然会高度自信地预测“dog”，只是所有下义词会有较低的置信度。

该方法也适用于检测。现在，我们不用假定每个图像都有一个目标物体，而是使用YOLOv2的目标预测器给出P r（目标物体）的值。检测器预测边界框和概率树。我们遍历树，在每次分割中选取具有最高的置信度的路径，直到达到某个阈值，然后我们得到该目标的类。

数据集与WordTree的组合。我们可以使用WordTree以可行的方式将多个数据集组合在一起。我们只需将数据集中的类别映射到树中的synsets即可。图6显示了一个使用WordTree组合来自ImageNet和COCO的标签的示例。 WordNet非常多样化，因此我们可以将这种技术用于大多数数据集。
在这里插入图片描述
图5：对ImageNet与WordTree的预测。大多数ImaNet模型使用一个大的softmax来预测概率分布。使用WordTree，我们通过共同的下位词执行多个softmax操作

联合分类和检测。现在我们可以使用WordTree组合数据集，在分类和检测上训练联合模型。我们想要训练一个非常大规模的检测器，所以使用COCO检测数据集和完整ImageNet版本中的前9000类创建我们的组合数据集。我们还需要评估我们的方法，以便从ImageNet检测挑战中添加任何尚未包含的类。该数据集的相应WordTree具有9418个类。ImageNet有更大的数据集，所以我们通过对COCO进行过采样来平衡数据集，使得ImageNet与COCO的比例略大于4：1。

我们使用上述的数据集训练YOLO9000。我们使用基本的YOLOv2架构，但只有3个先验而不是5个来限制输出大小。当我们的网络处理检测图像时，我们会像平常一样反向传播损失。对于分类损失，我们只是将损失反向传播到标签相应级别或更高的级别。例如，如果标签是狗，我们不会将任何错误给树做进一步预测，如德国牧羊犬与黄金猎犬，因为我们没有这些信息。
在这里插入图片描述
图6：使用WordTree层次结构组合数据集。使用WordNet概念图，我们构建了视觉概念的分层树。然后，我们可以通过将数据集中的类映射到树中的synsets来合并数据集。出于说明目的，这是WordTree的简化视图。

当网络处理分类图像时，我们只是反向传播分类损失。要做到这一点，我们只需找到预测该类别最高概率的边界框，然后在预测的树上计算损失。我们还假设预测框与真实框的IOU至少为0.3，并且基于这个假设我们反向传播目标损失。

利用这种联合训练，YOLO9000学习使用COCO中的检测数据来查找图像中的目标，并学习使用来自ImageNet的数据对各种这些目标进行分类。

我们在ImageNet检测任务上评估YOLO9000。 ImageNet的检测任务与COCO共享44个目标类别，这意味着YOLO9000看到的测试图像大多数是分类数据，而不是检测数据。 YOLO9000的总mAp是19.7 mAP，其中在不相交的156个目标类上，YOLO9000从未见过这些类的任何检测数据的标签，仍获得了16.0mAP。这个mAP高于DPM的结果，但YOLO9000是在部分监督[4]的不同的数据集上训练的。而且它能同时检测9000个其他目标类别，所有的检测都是实时的。

在分析YOLO9000在ImageNet上的表现时，我们发现它很好地学习了新的动物种类，但是在像服装和设备这样的学习类别中表现不佳。新动物更容易学习，因为目标预测可以从COCO中的动物泛化的很好。相反，COCO没有任何类型的衣服的边界框标签，只针对人，因此YOLO9000在分类“墨镜”或“泳裤”等类别上存在困难。

在这里插入图片描述
表7：ImageNet上的YOLO9000最佳和最差类别。 156个弱监督类的AP最高和最低的类。 YOLO9000模型很好地预测各种各样的动物，但不擅长预测诸如服装或设备等的新类。