YOLO v3 翻译（翻译与解释）

最新推荐文章于 2024-04-24 15:53:38 发布

ACxz

最新推荐文章于 2024-04-24 15:53:38 发布

阅读量1.5k

点赞数 1

分类专栏： Paper 文章标签：计算机视觉人工智能目标检测机器学习

本文链接：https://blog.csdn.net/pylittlebrat/article/details/121643241

版权

Paper 专栏收录该内容

14 篇文章 3 订阅

订阅专栏

YOLOv3: An Incremental Improvement

Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57:9 AP50 in 51 ms on a Titan X, compared to 57:5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

摘要
我们对YOLO提出了一些更新！为了使它更强，我们做了一点小的设计改变。我们也训练了这个极好的新网络。它比上次的网络稍微大一点点，但是更为精准。别担心，他仍然十分的快。在320×320时，YOLOv3在运行时间为22毫秒具备28.2的平均精准度，具备与SSD相同的准确度，而速度比它快三倍。当我们看旧的.5平均精准度衡量标准YOLOv3是很好的。它实现了再Titan X上训练51毫秒达到IoU阈值大于0.5平均精准度为57.9
，比较RetinaNet网络的198毫秒IoU阈值大于0.5平均精准度为57.5，相同的表现但却快了3.8倍。与往常一样，所有的代码在https://pjreddie.com/yolo/网上。

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little.I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other
people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a
source. So get ready for a TECH REPORT!

介绍
有时候你就把它搁置（kinda phone）一年，你知道吗?我这一年没有做大量的研究。花大量的时间在推特上。玩一下GANs。我从去年开始有了一点动力[12][1];我
设法对YOLO做了一些改进。但是,老实说,没什么特别有趣的，只是一堆小的让它变得更好的改变。我也帮助其他的人们做了一些研究。

The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

事实上，这就是我们今天来到这里的原因。我们有拍摄截止日期是[4]我们需要引用一些
我对YOLO的随机更新，但我们没有源。所以，准备好接受科技报道吧!
科技报告的伟大之处在于它们不需要介绍，你们都知道我们为什么在这里。介绍到此结束将为论文的其余部分指明方向。首先,我们要告诉YOLOv3到底是怎么回事。然后我们会告诉你怎么做我们所做的。我们也会告诉你一些我们尝试过的事情没有工作。最后，我们将考虑这一切意味着什么。

2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

处理这里有一些对YOLOv3的处理：我们大多是从别人那里获得好点子。我们也训练了一个比另一个性能更好的分类器。我们会从头开始（from scratch）带你浏览整个系统，所以你可以完全的理解它。

2.1. Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx; cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:
$b_x\ =\ \sigma \left( t_x \right) +c_x$ $b_y\ =\ \sigma \left( t_y \right) +c_y$ $b_w=p_{w}e^{t_w}$ $b_h=p_{h}e^{t_h}$

边界盒子预测
在YOLO9000之后，我们的系统使用维度集群作为锚盒[15]来预测绑定盒。网络为每个边界框预测4个坐标，tx, ty, tw, th。如果单元格与图像左上角的偏移量为(cx;Cy)，先验边界框有宽度和高度pw, ph，则预测为:
$b_x\ =\ \sigma \left( t_x \right) +c_x$ $b_y\ =\ \sigma \left( t_y \right) +c_y$ $b_w=p_{w}e^{t_w}$ $b_h=p_{h}e^{t_h}$

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is $\hat{t}_x$ our gradient is the ground truth value (computed from the ground truth box) minus our prediction: $\hat{t}_x$ - ${t}$ . This ground truth value can be easily computed by inverting the equations above.

这里是引用在训练期间，我们使用平方误差损失总和。如果某个坐标预测的真实值是 $\hat{t}_x$ ，我们的梯度是真实值（从真实值框计算）减去我们的预测： $\hat{t}_x$ - ${t}$ . 通过反转上面的等式，可以很容易地计算出这个真实值。

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of :5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions,only objectness.

YOLOv3 使用逻辑回归预测每个边界框的对象性分数。如果边界框先验与地面实况对象的重叠比任何其他边界框先验都多，则该值应为 1。如果边界框先验不是最好的，但确实与真实对象重叠超过某个阈值，我们将忽略预测，遵循 [17]。我们使用 :5 的阈值。与 [17] 不同的是，我们的系统仅先为每个地面实况对象分配一个边界框。如果边界框先验未分配给地面实况对象，则不会导致坐标或类别预测损失，只有对象性。

在这里插入图片描述
Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].

图 2. 具有维度先验和位置预测的边界框。我们将框的宽度和高度预测为与集群质心的偏移量。我们使用 sigmoid 函数预测框相对于过滤器应用位置的中心坐标。这个图光明正大的从[15]中提取。

2.2. Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.

类别预测
每个框使用多标签分类来预测边界框可能包含的类别。我们不使用 softmax，因为我们发现它对于良好的性能是不必要的，而是我们简单地使用独立的逻辑分类器。在训练期间，我们使用二元交叉熵损失进行类预测。

This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

当我们转向更复杂的领域（如开放图像数据集 [7]）时，此公式会有所帮助。在这个数据集中有许多重叠的标签（即女人和人）。使用 softmax 强加了一个假设，即每个框只有一个类，但通常情况并非如此。多标签方法可以更好地对数据建模。

2.3. Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N ×N ×[3 ×(4+1+80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

跨尺度预测
YOLOv3 预测 3 个不同尺度的框。我们的系统使用与特征金字塔网络类似的概念从这些尺度中提取特征 [8]。从我们的基本特征提取器中，我们添加了几个卷积层。最后一个预测 3-d 张量编码边界框、对象性和类别预测。在我们使用 COCO [10] 的实验中，我们在每个尺度上预测了 3 个框，因此对于 4 个边界框偏移、1 个对象预测和 80 个类别预测，张量是 N ×N ×[3 ×(4+1+80)]。

Next we take the feature map from 2 layers previous and upsample it by 2. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.

接下来，我们从之前的 2 层中获取特征图并将其上采样 2 。我们还从网络的早期获取特征图，并使用串联将其与我们的上采样特征合并。这种方法使我们能够从上采样的特征中获得更有意义的语义信息，并从早期的特征图中获得更细粒度的信息。然后我们再添加几个卷积层来处理这个组合的特征图，并最终预测一个类似的张量，尽管现在是两倍。

We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.

我们再次执行相同的设计以预测最终比例的框。因此，我们对第三个尺度的预测受益于所有先前的计算以及网络早期的细粒度特征。

We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13); (16×30); (33×23); (30×61); (62×45); (59×119); (116 × 90); (156 ×198); (373×326).

我们仍然使用 k-means 聚类来确定我们的边界框先验。我们只是随意选择了 9 个集群和 3 个尺度，然后在各个尺度上均匀划分集群。在 COCO 数据集上，9 个集群是：(10×13)； (16×30); (33×23); (30×61); (62×45); (59×119); (116 × 90); (156 ×198); （373×326）。

2.4. Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3×3 and 1×1 convolutional layers but now has some shortcut connections
as well and is significantly larger. It has 53 convolutional layers so we call it… wait for it… Darknet-53!

特征提取器
我们使用一个新的网络来执行特征提取。我们的新网络是 YOLOv2、Darknet-19 中使用的网络和新奇的残差网络之间的混合方法。我们的网络使用连续的 3×3 和 1×1 卷积层，但现在有一些捷径连接
以及明显更大。它有 53 个卷积层，所以我们称之为…等待它… Darknet-53！

在这里插入图片描述
This new network is much more powerful than Darknet-19 but still more efficient than ResNet-101 or ResNet-152.Here are some ImageNet results:

这个新网络比 Darknet-19 强大得多，但仍然比 ResNet-101 或 ResNet-152 更有效。以下是一些 ImageNet 结果：

在这里插入图片描述
Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1:5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.

每个网络都使用相同的设置进行训练，并以 256×256 的单次裁剪精度进行测试。运行时间是在 Titan X 上以 256 × 256 测量的。因此 Darknet-53 的性能与最先进的分类器相当，但浮点运算更少，速度更快。 Darknet-53 比 ResNet-101 好，速度快 1:5 倍。 Darknet-53 的性能与 ResNet-152 相似，速度快 2 倍。

Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

Darknet-53 还实现了每秒最高的实测浮点运算。这意味着网络结构更好地利用了 GPU，使其评估更高效，从而更快。这主要是因为 ResNet 的层数太多，而且效率不高。

2.5. Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

训练
我们仍然在没有硬性负挖掘或任何此类内容的情况下对完整图像进行训练。我们使用多尺度训练、大量数据增强、批量归一化，以及所有标准的东西。我们使用暗网神经网络框架进行训练和测试 [14]。

3. How We Do

YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.

我们如何做
YOLOv3 相当不错！见表 3。就 COCO 奇怪的平均平均 AP 指标而言，它与 SSD 变体相当，但快 3 倍。尽管如此，它在这个指标上仍然远远落后于 RetinaNet 等其他模型。

However, when we look at the “old” detection metric of mAP at IOU= :5 (or AP50 in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects.

但是，当我们查看 IOU=:5（或图表中的 AP50）时 mAP 的“旧”检测指标时，YOLOv3 非常强大。它几乎与 RetinaNet 相当，远高于 SSD 变体。这表明 YOLOv3 是一个非常强大的检测器，擅长为物体生成像样的盒子。

However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object.

然而，随着 IOU 阈值的增加，性能显着下降，这表明 YOLOv3 难以使框与对象完美对齐。

In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.

过去 YOLO 挣扎于小物体识别。然而，现在我们看到了这一趋势的逆转。通过新的多尺度预测，我们看到 YOLOv3 具有相对较高的 APS 性能。但是，它在中等和较大尺寸的物体上的性能相对较差。更多调查需要深入了解这一点。

When we plot accuracy vs speed on the AP50 metric (see figure 5) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.

当我们在 AP50 指标上绘制精度与速度的关系图时（见图 5），我们发现 YOLOv3 比其他检测系统具有显着优势。也就是说，它更快更好。

4. ThingsWe Tried That Didn’t Work

We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.

我们尝试过但没有用的东西
我们在研究 YOLOv3 时尝试了很多东西。很多都没有用。这是我们能记住的东西。

Anchor box x; y offset predictions. We tried using the normal anchor box prediction mechanism where you predict the x; y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.

锚框x； y 偏移预测。我们尝试使用您预测 x 的普通锚框预测机制； y 使用线性激活作为框宽度或高度的倍数偏移。我们发现这种公式降低了模型的稳定性并且效果不佳。

Linear x, y predictions instead of logistic. We tried using a linear activation to directly predict the x,y offset instead of the logistic activation. This led to a couple point drop in mAP.

线性 x, y 预测而不是逻辑预测。我们尝试使用线性激活来直接预测 x,y 偏移而不是逻辑激活。这导致 mAP 下降了几个点。

Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.

焦点损失（不知道具体怎么翻译，平时都是直接称呼为Focal loss）。我们尝试使用焦点损失。它使我们的 mAP 下降了大约 2 分。 YOLOv3 可能已经对焦点损失试图解决的问题具有鲁棒性，因为它具有单独的对象预测和条件类预测。因此，对于大多数示例，类预测没有损失？或者其他的东西？我们不完全确定。

在这里插入图片描述

Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3-.7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.

双 IOU 阈值和真值分配。 Faster RCNN 在训练期间使用两个 IOU 阈值。如果预测与地面实况重叠 0.7，则为正例，在 [.3-.7] 处，它被忽略，对于所有地面实况对象小于 0.3，则为反例。我们尝试了类似的策略，但没有得到好的结果。

We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.

我们非常喜欢我们目前的公式，它似乎至少处于局部最优。其中一些技术可能最终会产生良好的结果，也许它们只需要进行一些调整即可稳定训练。

5. What This All Means

YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.

这一切意味着什么
YOLOv3 是一个很好的检测器。它很快，很准确。在 0.5 到 0.95 IOU 指标之间的 COCO 平均 AP 上没有那么好。但它在 0.5 IOU 的旧检测指标上非常好。

Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly difficult.” [18] If humans have a hard time telling the difference, how much does it matter?

我们为什么要切换指标？最初的 COCO 论文只有这句话：“评估服务器完成后将添加评估指标的完整讨论”。 Russakovsky 等人报告说，人类很难区分 0.3 和 0.5 的 IOU！ “训练人类目视检查 IOU 为 0.3 的边界框并将其与 IOU 为 0.5 的边界框区分开来非常困难。” [18] 如果人类很难分辨出差异，这有多大关系？

But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to… wait, you’re saying that’s exactly what it will be used for?? Oh.

但也许更好的问题是：“既然我们拥有这些探测器，我们将用它们做什么？”很多从事这项研究的人都在谷歌和 Facebook。我想至少我们知道这项技术掌握得很好，绝对不会被用来收集你的个人信息并将其出售给…等等，你是说这正是它的用途？？oh…

I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

我非常希望大多数使用计算机视觉的人都在用它做快乐的好事，比如计算国家公园中斑马的数量 [13]，或者跟踪他们家中闲逛的猫 [19] ]。但是计算机视觉已经被用于有问题的用途，作为研究人员，我们有责任至少考虑我们的工作可能造成的危害并想办法减轻它。我们欠这个世界太多了。

In closing, do not@me. (Because I finally quit Twitter).

yolo创始人因不满美国将该技术用于军事选择退游！

注释：

网络架构解读：

yolov1保持着传统的CNN算法设计模式，只有简单的卷积池化以及全连接的形式。但后续版本的结构都发生了极大的变化。比如在经过几次升级过后的yolov3，只有卷积层，通过调节卷积步长控制输出特征图的尺寸。所以对于输入图片尺寸没有特别限制。其借鉴了金字塔特征图思想，小尺寸特征图用于检测大尺寸物体，而大尺寸特征图检测小尺寸物体。一共3个Anchor框，每个框有4维预测框数值，1维预测框置信度和若干维物体类别数。Yolov3通过concat操作和上采样层的方式，总共输出3个特征图。第一个特征图下采样32倍，第二个特征图下采样16倍，第三个下采样8倍。Yolo的整个网络，吸取了Resnet、Densenet、FPN的精髓。再前向过程中沿用了维度聚类作为先验框来预测边界框。通过k-means方法对数据集中的目标框进行维度聚类，得到9 个不同大小的先验框，并将其均分到多个尺度的特征图上，尺度更大的特征图使用更小的先验框。在训练过程中，使用二元交叉熵损失来进行类别预测。
由于yolo v3的网络图，论文和后续文章没有给出，所以我从这里薅了一张图非常准确且容易理解。
在这里插入图片描述

训练策略与损失函数：

Yolov3 Loss为三个特征图Loss之和：

$Loss\ =\ Loss_{N_1}\ +\ \ Loss_{N_2}\,\,+\ Loss_{N_3}\,\,$

单个Loss抽象为:

$loss_{_{N_1}}\,\,=\,\,\lambda _{box}\sum_{\,\,i=\,\,0}^{N_1\times N_1}{\sum_{j=0}^3{1_{ij}^{obj}\left[ \left( t_x-t_{x}^{'} \right) ^2+\left( t_y-t_{y}^{'} \right) ^2 \right]}}$ $+\,\,\lambda _{box}\sum_{i=0}^{N_1\times N_1}{\sum_{j=0}^3{1_{ij}^{obj}\left[ \left( t_w-t_{w}^{'} \right) ^2+\left( t_h-t_{h}^{'} \right) ^2 \right]}}$ $-\,\,\lambda _{obj}\sum_{i=0}^{N\times N}{\sum_{j=0}^3{1_{ij}^{obj}log\left( c_{ij} \right)}}$ $-\lambda _{noobj}\sum_{i=0}^{N_1\times N_1}{\sum_{j=0}^3{1_{ij}^{noobj}log\left( 1-c_{ij} \right)}}$ $-\lambda _{class}\sum_{i=0}^{N_1\times N_1}{}$ $\sum_{j\,\,=\,\,0}^3{1_{ij}^{obj}}\sum_{c\in classes}{\left[ p_{ij}^{'}\left( c \right) log\left( p_{ij}\left( c \right) \right) +\left( 1-p_{ij}^{'}\left( c \right) \right) log\left( 1-p_{ij}\left( c \right) \right) \right]}$