yolov2论文_目标检测 YOLOv3 论文翻译 (高质量版)

623c28114f0a89dba878cfd523876c80.png

1b9808782f4d1d4a425ee0209a9c414e.png

基于工作中对某某证券项目中,使用到了场景识别模型。模型采用的 YOLOv3,因此花了些时间研究了这篇论文,本人查了很多资料,也加入了自己的理解,以保证论文翻译的质量。如发现错误,欢迎留言指正。

本文源自于个人网站并收录于微信公众号:阅读充电笔记(ID:yueduchongdianbiji)。

YOLOv3 论文翻译(高质量版)​www.taominze.com
0789cb48c33b02adb0cf89d8242f4155.png
YOLOv3 论文地址​pjreddie.com
本文目录
0. 摘要
1. 引言
2. 更新
2.1 边界框预测
2.2 分类预测
2.3 跨尺寸预测
2.4 特征提取器
2.5 训练
3. 我们做的怎么样
4. 失败的尝试
5. 更新的意义

0. 摘要 Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

我们对 YOLO 做了一些更新!通过一些小的设计更改使其变得更好。我们也训练了这个非常好的新网络。虽然它比上个版本 YOLOv2 的体量稍微大了一些,但是精准度更高了,并且它的速度仍然很快,所以不用担心。在输入 320 × 320 的图片后,YOLOv3 能在 22 毫秒内完成处理,并取得 28.2 mAP 的检测精准度,它的精准度和 SSD 相当,但是速度要快 3 倍。当我们用老的 .5 IOU mAP 检测指标时,YOLOv3 的精准度也是相当好的。在 Titan X 环境下,YOLOv3 在 51 毫秒内实现了 57.9 AP50 的精准度,和 RetinaNet 在 198 毫秒内的 57.5 AP50 相当,但是 YOLOv3 速度要快 3.8 倍。和以前一样,所有代码均在 https://pjreddie.com/yolo/。

1. 引言 Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

Actually, that’s what brings us here today. We have a camera-ready deadline and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT!

The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

其实有时候有些人一年的时间就这样蹉跎了,你懂的。所以去年我也没做什么研究,主要用来刷 Twitter 了,在 GAN 上花了些时间,回头一看,自己好像还留了点精力出来,于是想着,更新一下 YOLO 好了。说实话,这仅仅是一堆很小的改动,让 YOLO 变得更好,因此也不是件多么有趣的事。另外,我也帮别人做了一些研究。

事实上,这也是我今天要讲的内容。我们有一篇论文快截稿了,但还缺一篇关于 YOLO 更新内容的文章作为引用来源,所以以下就是我们的技术报告!

技术报告的伟大之处在于它不需要引言,你们都知道我们写这篇论文的目的,所以这里的引言可以作为你阅读的一个指引。首先我们会告诉你 YOLOv3 的更新情况,其次我们会展示更新方法和一些失败的尝试,最后就是对这次更新意义的总结。

2. 更新 Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

这是 YOLOv3 的更新情况:大多数时候我们就是直接从别人那里获得好点子。我们训练了比其他网络更好的分类器网络。为了方便你理解,我们从头开始介绍整个系统。

033910bc914b481bf9fbb1c5fa005cc6.png

2.1 边界框预测 Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th,If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

bx= σ(tx)+cx

by= σ(ty)+cy

bw= pwetw

bh= pheth

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ* our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ*− t*. This ground truth value can be easily computed by inverting the equations above.

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following. We use the threshold of .5. Unlike our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

在 YOLO9000 之后,我们的系统使用维度聚类(dimension cluster)固定 anchor boxes 来选定边界框。神经网络为每个边界框预测 4 个坐标,tx, ty, tw, th。如果目标 cell 距离图像左上角的边距是 (cx, cy) ,且对应的边界框的高和宽为 pw, ph ,那么网络的预测值是:

bx= σ(tx)+cx
by= σ(ty)+cy
bw= pwetw
bh= pheth

在训练期间,我们会计算平方差损失的总和。如果预测坐标的 ground truth 是 tˆ* ,那相应的梯度就是 ground truth(由 ground truth box 计算得到) 减去我们的预测值:tˆ*− t* 。通过这个公式可以很容易计算出 ground truth 。

YOLOv3 使用逻辑回归预测每个边界框的对象分(objectness score)。如果当前预测的边界框比之前其他的任何边界框更好的与 ground truth 对象重合,那它的分数就是 1。如果当前预测的边界框不是最好的,但它和 ground truth 对象重合了一定的阈值以上,神经网络会忽略这个预测。我们使用的阈值是 0.5。我们的系统只为每个 ground truth 对象分配一个边界框。如果当前的边界框未分配给相应的 ground truth 对象,那它仅仅是检测错了对象,不会对坐标或分类预测造成影响。

45fb4108aec7c7cdc97874f757dbd9f6.png

2.2 分类预测 Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.

This formulation helps when we move to more complex domains like the Open Images Dataset. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

每个边界框使用多标签分类来预测边界框中可能包含的类。我们不使用 softmax,而是使用单独的逻辑分类器,因为我们发现前者在对于神经网络高性能没有什么必要。 在训练过程中,我们使用二元交叉熵损失来进行类别预测。

这个选择有助于我们把 YOLOv3 用于更复杂的领域,如公开图像数据集。这个数据集包含了大量重叠的标签(如女性和人)。如果我们用 softmax,它会强加一个假设,使每个框只包含一个类别,但通常情况并非如此。多标签的分类方法能够更好地模拟数据。

2.3 跨尺寸预测 Prediction Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks. From our base feature extractor we add several convolutional layers. The last of these predicts a 3d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO we predict 3 boxes at each scale so the tensor is N × N × [3∗(4+1+80)]for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finergrained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.

We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.

We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59×119), (116×90), (156×198), (373×326).

YOLOv3 提供了 3 种不同尺寸的边界框。我们的系统用相似的概念提取这些尺寸的特征,以形成金字塔形网络。我们在基本特征提取器中增加了几个卷积层,并用最后的卷积层预测一个三维张量编码:边界框、框中目标和分类预测。在 COCO 数据集实验中,我们的神经网络分别为每种尺寸各预测了 3 个边界框,所以得到的张量是 N × N × [3∗(4+1+80)],其中包含 4 个边界框偏移值、1 个目标预测以及 80 种分类预测。

接下来我们从前面的 2 层中获取特征图(feature map),并将其上采样 2 倍。我们还从更早的网络图层中获取特征图,并使用 concatenation 将其与我们的上采样特征进行合并。这种方法使我们能够从早期特征映射中的上采样特征和更细粒度的信息中获得更有意义的语义信息。然后,我们添加几个卷积层来处理这个组合的特征图,并最终预测出一个相似的、大小是原先两倍的张量。

我们用同样的网络设计来预测边界框的最终尺寸,其实这个过程有助于我们对第三种尺寸的预测,因为我们可以从早期网络中获得细粒度的特征。

和上个版一样,我们仍然使用 k-means 聚类来确定边界框的先验。在实验中,我们选择了 9 个聚类和 3 个尺寸,然后在不同尺寸的边界框上均匀分割聚类。在 COCO 数据集上,这 9 个聚类分别是:(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116×90)、(156×198)、(373×326) 。

2.4 特征提取器 Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!

我们使用了一个新网络来实现特征提取。我们的新网络融合了 YOLOv2、Darknet-19 和 残差网络,由连续的 3 × 3 和 1 × 1 的卷积层组合而成,也使用了一些残差块,整体体量变得更大,一共有 53 个卷积层,所以我们称它为...别急... Darknet-53!

93419657328bb1e86384fdd3bedd8bdc.png

This new network is much more powerful than Darknet-19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

这个新网络在性能上远超 Darknet-19,但在效率上同样优于 ResNet-101 和 ResNet-152。下表是在 ImageNet 上的实验结果:

25c2b87a657e750236f15f7767c09196.png

Each network is trained with identical settings and tested at 256 × 256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.

Darknet-53 also achieves the highest measured floating point operations per second. This means the network struc- ture better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

每个网络都使用相同的设置进行训练,输入 256 × 256 的图片,并进行单精度测试。运行环境为 Titan X。我们得出的结论是 Darknet-53 在精度上可以与最先进的分类器相媲美,同时它的浮点运算更少,速度也更快。和 ResNet-101 相比,Darknet-53 的速度是前者的 1.5 倍;而 ResNet-152 和它性能相似,但用时却是它的 2 倍以上。

Darknet-53 也可以实现每秒最高的测量浮点运算。这意味着网络结构可以更好地利用 GPU,使其预测效率更高,速度更快。这主要是因为 ResNets 的层数太多,效率不高。

2.5 训练 Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing.

我们只是输入完整的图像,并没有做其他处理。实验过程中涉及的多尺寸训练、大量数据增强和批量标准化等操作均符合标准。模型训练和测试的框架是 Darknet 神经网络。

3. 我们做的怎么样 How We Do

YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.

YOLOv3 的表现非常好!请参考表格 3,对于 COCOs 数据集那些奇怪的 mAP 评价指标,它的表现与与 SSD 相当,但速度提高了 3 倍。尽管如此,它仍然比像 RetinaNet 这样的模型要差不少。

9bcc8d444ef748bcdf3b59f7cade210c.png

However, when we look at the “old” detection metric of mAP at IOU= .5 (or AP50 in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the http://object.In the past YOLO struggled with small objects.

However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.

When we plot accuracy vs speed on the AP50 metric (see figure 3) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.

但是,当我们用以前的评价指标,IOU=0.5 时的 mAP(表中的 AP50)来评价 YOLOv3 时,我们发现它真的很强。它几乎与 RetinaNet 媲美,且远远高于 SSD。这表明它是一个非常强大的检测器,擅长为检测目标生成合适的边界框。但是,随着 IOU 阈值的增加,YOLOv3 性能显著下降,这时候 YOLOv3 预测的边界框就不能做到完美对齐了。

在过去,YOLO 一直被用于小型对象检测。但现在我们可以看到其中的演变趋势,随着多尺寸预测功能的上线,YOLOv3 将具备更高 APS 性能。但它目前在中等尺寸或大尺寸物体上的检测表现还相对较差,仍需进一步的完善。

当我们基于 AP50 指标来绘制精度和速度时(见图 3),我们发现 YOLOv3 与其他检测系统相比具有显著优势。也就是说,它的速度正在越来越快。

84eab77a33be9ffbcf40cb2263eafc4a.png

4. 失败的尝试 Things We Tried That Didn't Work

We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.

Anchor box x, y offset predictions.We tried using the normal anchor box prediction mechanism where you predict the x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.

Linear x, y predictions instead of logistic.We tried using a linear activation to directly predict the x, y offset instead of the logistic activation. This led to a couple point drop in mAP.

Focal loss.We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.

Dual IOU thresholds and truth assignment.Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3 − .7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.

We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.

我们在研究 YOLOv3 的时候尝试了很多东西,以下是我们记得的一些失败案例。

Anchor box 坐标的偏移预测。我们尝试了常规的 Anchor box 预测方法,比如使用线性激活来将坐标 x,y 的偏移预测为边界框宽度或高度的倍数。但我们发现,这种做法降低了模型的稳定性而且效果不佳。

用线性方法预测 x,y,而不是使用逻辑方法预测。我们尝试使用线性激活来直接预测 x,y 的偏移,而不是使用逻辑激活,这降低了 mAP 成绩。

Focal loss。我们尝试使用 focal loss,但它使我们的 mAP 大概降低了 2 个点。这可能是因为它具有单独的对象预测和条件类别预测,YOLOv3 对于 focal loss 函数试图解决的问题已经具有相当的鲁棒性。因此,对于大多数例子而言,类别预测没有损失?或者其他什么原因?我们并不完全确定。

双 IOU 阈值和真值分配。在训练期间,Faster RCNN 用了两个 IOU 阈值,如果预测的边框与标注边框的重合度不低于 0.7 ,那判定它为正样本;如果在 [0.3~0.7] 之间,则忽略;如果低于 0.3 ,就判定它为负样本。我们也尝试了这种方法,但是效果并不好。

我们对现在的模型很满意,至少是目前的最佳状态。上述的有些技术可能会使我们的额模型更好,但我们可能还需要对它们做一些调整。

5. 更新的意义 What This All Means

YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.

Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly difficult.” If humans have a hard time telling the difference, how much does it matter?

But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal infor- mation and sell it to.... wait, you’re saying that’s exactly what it will be used for?? Oh.

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait.....

I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park, or tracking their cat as it wanders around their house. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

In closing, do not @ me. (Because I finally quit Twitter).

YOLOv3 是一个非常棒的检测器,它又准又快。虽然它在 COCO 数据集上,0.3 和 0.95 的新指标上的成绩并不好,但对于旧的检测指标 0.5 IOU,它还是非常不错的。

所以,为什么我们要改变指标呢?最初的 COCO 论文中只有这样的一句含糊其辞的话:评估完成,就会生成评估指标结果。但是 Russakovsky 等人曾经有份报告,说人类很难区分 0.3 和 0.5 的 IOU:“训练人们用肉眼区别 IOU 值为 0.3 和 0.5 的边界框是一件非常困难的事”。既然人类都难以区分,那这个指标还重要吗?

也许有个更好的问题值得我们去探讨:“我们用它来做什么”,很多从事这方面研究的人都受雇于 Google 和 Facebook,我想至少我们知道如果这项技术发展的完善,那他们绝对不会把它用来收集你的个人信息然后卖给...等一下,你把事实说出来了!!哦哦。

另外军方在计算机视觉领域投入了大量资金,他们从来没有做过任何可怕的事情,比如用新技术杀死很多人...哦哦。

我有很多希望!我希望大多数人会把计算机视觉技术用于快乐、幸福的事情上,比如就统计国家公园里斑马的数量,或者追踪他们小区附近有多少只猫。但是计算机视觉技术的应用已经步入歧途了,作为研究人员,我们有责任思考自己的工作可能带给社会的危害,并考虑怎么减轻这种危害。我们非常珍惜这个世界。

最后,不要在 Twitter 上 @ 我,我已经不玩了。


作者:Kevin Tao(www.taominze.com

来源:阅读充电笔记(ID:yueduchongdianbiji)

备注:转载请注明出处

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值