python精妙算法_YOLOv4：高速物体检测的精妙之处

最新推荐文章于 2024-04-25 09:53:16 发布

weixin_26632369

最新推荐文章于 2024-04-25 09:53:16 发布

阅读量1.3k

点赞数

文章标签： python 算法人工智能 opencv 计算机视觉

原文链接：https://medium.com/@mark.s.cleverley/yolov4-the-subtleties-of-high-speed-object-detection-91eced12c6a4

版权

本文深入探讨了YOLOv4这一高速物体检测算法的精妙之处，通过Python实现，揭示了其在计算机视觉领域的高效性能。文章结合原始资料，详细解析了YOLOv4的关键技术和优化策略。

摘要由CSDN通过智能技术生成

python精妙算法

One of the more impressive things you can do is look at things and understand exactly what you’re seeing. Your brain is constantly receiving two feeds of photon interpretations and somehow you’re able to say “this is a pineapple and this is pizza” — and you then know never to combine the two.

您可以做的最令人印象深刻的事情之一就是看事物并确切地了解您所看到的。您的大脑不断收到两种关于光子解释的信息，您可以用某种方式说“这是一个菠萝，这是比萨”，然后您就永远不知道将两者结合起来。

Object recognition is impressive because it’s 1) abstract (babies have no idea what’s going on) and 2) damn hard to learn. As AI improves, we’re constantly grappling with this challenge, which consists of detection (something is here) and classification (what is this).

对象识别令人印象深刻，因为它是1)抽象(婴儿不知道发生了什么)和2)该死的很难学习。随着AI的改进，我们一直在不断地应对这一挑战，其中包括检测( 此处有内容 )和分类( 这是什么 )。

That being said, it’s a much easier task to identify open parking spots in a still image at a leisurely pace than it is to identify “CAR” “BICYCLE” “RED LIGHT” at 40mph.

话虽如此，要以轻松的速度识别静止图像中的空旷停车位比识别以40mph速度行驶的“ CAR”，“ BICYCLE”和“ RED LIGHT”要容易得多。

One of the more promising object-focused CNNs is “You Only Look Once”, a neat open-source model written mostly in C/C++ and assembled with Python.Here’s a video of “YOLOv3” in action that speaks for itself:

更有前景的以对象为中心的CNN之一是“ You Only Look Once”，这是一个主要用C / C ++编写并与Python组装在一起的简洁的开源模型。以下是“ YOLOv3”的视频，不言而喻：

This is wicked good.

这是邪恶的 。

My favorite part is 1:35 when it labels the bird “sports ball” and then “dog” before the camera shifts and you can see its beak & side profile. That says quite a lot about how image nets work, really. It’s been fed a ton of images that include birds, dogs and balls, so it knows balls are round, dogs are horizontally rectangular, and birds have beaks.

我最喜欢的部分是1:35，它在摄像机移动之前将鸟标记为“sports ball” ，然后标记为“dog” ，您可以看到其喙和侧面轮廓。确实，这充分说明了图像网络的工作原理。它被喂入了大量的图像，包括鸟，狗和球，因此它知道球是圆形的，狗是水平矩形的，而鸟则有喙。

Bochkovskiy, Wang & Liao released an updated version this April that boasts significantly improved performance with minimal training time; I think their write-up deserves a lot more attention, so I wanted to dive deep into what makes it stand out over the competition.The results of version 4 are hard to ignore; here’s it’s Average Precision at different framerates using the MS COCO dataset:

Bochkovskiy，Wang和Liao于今年4月发布了更新版本，该版本以最少的培训时间显着提高了性能。我认为他们的文章值得更多关注，所以我想深入研究在竞争中脱颖而出的原因。版本4的结果令人难以忽视；这是使用MS COCO数据集在不同帧速率下的平均精度：

Many CNN object detectors are used for slow recommendation systems. When you’re asking “what type of clothing is in this stock photo”, you can afford a net that takes its time to reach maximum precision.

许多CNN对象检测器用于慢速推荐系统。当您问“这张照片中的衣服是什么类型”时，您可以负担得起花费时间才能达到最高精度的网。

But there’s a great deal of vision-related tasks that require real-time processing, like a self-driving car. Faster performance on video feeds allows AI to take over more “human” and “reactive” tasks, so we chase it eternally.

但是，有很多与视觉相关的任务需要实时处理，例如自动驾驶汽车。视频流的更快性能使AI可以接管更多的“人工”和“React性”任务，因此我们永远追逐它。

The team’s goal for v4 was thus to optimize performance at higher speeds while keeping the model lightweight — ideally, they wanted it to train and run on a single GPU rather than requiring heavier machinery.

因此，团队对于v4的目标是在保持模型轻巧的同时，以更高的速度优化性能-理想情况下，他们希望它可以在单个GPU上进行训练和运行，而不需要重型机械。

This is a tall order; normally there’s some sort of tradeoff between those qualities. But with some clever architectural tricks, the team managed to boost speed, accuracy and weight at the same time.

这是一个很高的要求；通常，在这些品质之间要进行某种权衡。但是，借助一些巧妙的体系结构技巧，团队设法同时提高了速度，准确性和重量。

速度与精度 (Speed vs. Precision)

Measuring accuracy on object detection is a bit more complex than statistical predictions, but the principles are largely the same. Instead of pure precision & recall judgments on True Positives and such, nets like YOLO often use AP50, or “Average Precision at .50 Intersection Over Union”.

物体检测的测量精度比统计预测要复杂一些，但是原理基本相同。诸如YOLO之类的网络不是使用对True Positives的纯粹精度和召回判断，而是经常使用AP50或“ .50交集上的平均精度”。

IoU, or the Jaccard Index, is actually fairly intuitive to look at: If the top left box predicts a rabbit within an image and the bottom left box is the actual boundary-box rabbit, the IoU sort of compares: “How many pixels did the prediction get right compared to the total pixels it could have gotten right + the pixels it missed?”.

IoU或Jaccard Index实际上是非常直观的：如果左上方的框预测图像中的兔子，而左下方的框则是实际的边界框兔子，则IoU会进行比较： “预测与正确的总像素+错过的像素相比正确了吗？” 。

Average Precision is a clean way of synthesizing “accuracy”: it can be defined as the area under the precision-recall curve. This streamlined single evaluation metric makes it simpler to compare different model performance.

平均精度是一种综合“精度”的干净方法：可以将其定义为精度调用曲线下的面积。这种简化的单一评估指标使比较不同模型的性能变得更加简单。

蓝图和书包 (Blueprints and Bags)

The YOLOv4 model has several distinct “types” of layers, from bottom to top:

YOLOv4模型从底部到顶部具有几种不同的层“类型”：

Input: Feeds image into network
输入：将图像输入网络
Backbone: Detects objects in image
骨干：检测图像中的对象
Neck: Collects feature maps from different layers
颈部：从不同图层收集要素地图
Head: Outputs predicted bounding boxes & classes for objects
Head ：输出对象的预测边界框和类

Selecting the pieces of model architecture is tricky. There’s a lot to balance in the composite task of “recognition”.

选择模型架构的各个部分非常棘手。在“识别”的复合任务中有很多要平衡的地方。

Detectors need:

检测器需要：

Higher input size (resolution) — to detect multiple smaller objects
较大的输入大小(分辨率)—检测多个较小的对象
More layers for a higher receptive field — to cover the increased input size (bigger image needs more processing)
更多的层用于更高的接收场-覆盖增加的输入大小(更大的图像需要更多处理)
More parameters — to better detect differently-sized objects in one image
更多参数-更好地检测一张图像中尺寸不同的物体

The receptive field, essentially “what the net is looking at within the image and how”, has many size-related considerations. It needs to be big enough to see the entire object, as well as the context around the object to differentiate it from the background.

接收域本质上是“网络在图像中所观察的内容以及方式”，它具有许多与尺寸有关的考虑因素。它必须足够大才能看到整个对象，以及对象周围的上下文以将其与背景区分开。

The best way to boost the receptive field is adding more layers, but casually tacking on extra convolutional layers can inflate the computational time, so some workarounds are required.

增强接收场的最佳方法是增加更多的层，但是随意增加额外的卷积层会增加计算时间，因此需要一些解决方法。

The team experiments with two ImageNet-pretrained detection networks, CSPResNeXt-50 (“Resnext”) and CSPDarknet53 (“Darknet”) as the backbone. The neck consists of PAN and SAM layers (we’ll get to that later), while the head is YOLOv3, in a strangely recursive fashion. Bootstrapping is rarely a bad idea.

该团队以ImageNet预先训练的两个检测网络CSPResNeXt-50(“ Resnext”)和CSPDarknet53(“ Darknet”)为骨干进行了实验。颈部由PAN和SAM层组成(稍后再介绍)，而头部为YOLOv3(以奇怪的递归方式)。自举很少是一个坏主意。

袋装技巧 (Bags of Tricks)

To optimize their network, the team identified two categories of improvement techniquies:

为了优化他们的网络，团队确定了两类改进技术：

Bag of Freebies: Augmentation & other non-layer = accuracy gain at no computation/time cost
一袋免费赠品 ：增强和其他非分层=无需任何计算/时间成本即可获得准确性
Bag of Specials: plug-in module layers & post-processing methods = significant detection increase with some time cost
特价袋 ：插入式模块层和后处理方法=大量的检测增加了一定的时间成本

“Freebies” include some interesting image augmentation and other methods that solve semantic bias & class imbalances in the dataset. I’m a big fan of their new “mosaic” data augmentation process:

“免费赠品”包括一些有趣的图像增强以及其他解决数据集中语义偏差和类不平衡的方法。我非常喜欢他们新的“马赛克”数据增强过程：

The idea behind image augmentation is to randomly create noise in your data, which helps prevent over-training and makes for a more nose-resilient, robust detector overall. This mosaic feature splices parts of 4 images together at irregular crop thresholds to make one composite image that probably shouldn’t exist in reality.

图像增强背后的想法是在数据中随机产生噪声，这有助于防止过度训练，并使整个检测器具有更强的鼻部韧性。这项镶嵌功能可将4张图像的部分以不规则的裁切阈值拼接在一起，从而制作出实际上不应该存在的一张合成图像。

Imagine you’re the robot here: The mosaic 1) messes with your perspective and 2) presents you with more diverse object types & sizes in the same image. A net trained with “busier” and more diverse examples should perform better in a busier environment — it will be harder to surprise.

假设您是这里的机器人：马赛克1)弄乱了您的视角，2)在同一张图片中为您提供了更多不同的对象类型和大小。经过“更忙碌”和更多样例培训的网络应该在更忙碌的环境中表现更好-很难感到惊讶。

Consider it like this: detecting and classifying pigeons, people and benches is much harder if you suddenly add bears. Most things are, to be fair.

这样考虑：如果突然添加熊，对鸽子，人和长凳进行检测和分类要困难得多。公平地说，大多数事情都是如此。

特殊战术 (Special Tactics)

As part of the “Bag of Specials” the team introduces a 2-stage “Self-Adversarial Training” layer: In stage one, the net alters the original image instead of altering its weights, creating a “deception” image without the desired object. In stage two, the net is told to go ahead and detect something anyway.

作为“特价商品”的一部分，团队引入了两阶段的“自我专业训练”层：在第一阶段，网更改了原始图像而不是改变了权重，从而创建了没有所需对象的“欺骗”图像。。在第二阶段，网被告知继续进行检测。

This is the deep learning equivalent of your calculus professor tossing in a question on Babylonian religious rites to keep you on your toes. If the network’s getting good at detecting bicycles, then telling it to ignore bicycles in the Tour de France gallery will get it to pay more attention to things it otherwise may have overlooked.

这是您的微积分教授在有关巴比伦宗教仪式的一个问题中抛出问题的深度学习知识，以使您保持警惕。如果网络能够很好地检测自行车，那么在环法自行车赛画廊中告诉它忽略自行车将使它更加关注它可能被忽略的事情。

The team also modifies existing modules. The “Self-Adaptive Module” or SAM is designed to streamline detection, as it doesn’t require special configuration to choose particular receptive fields/depths. The team changes their SAM from spatial-wise attention to point-wise to cut down on calculation time.

该团队还修改了现有模块。 “自适应模块”或SAM旨在简化检测，因为它不需要特殊配置即可选择特定的接收场/深度。团队将他们的SAM从空间注意转移到点注意以减少计算时间。

The “Path Aggregation Network”, PAN, is intended to optimize connections between low and high layers, ensuring the most important information from each feature level is passed along. The team alters PAN’s additive shortcut connection to concatenation — concatenating matrices is faster than adding them.

“路径聚合网络” PAN用于优化低层和高层之间的连接，以确保传递每个功能级别中最重要的信息。团队将PAN的附加快捷方式连接更改为串联-串联矩阵比添加矩阵更快。

调优与培训 (Tuning & Training)

The team did a great deal of work on hyperparameter combinations, using genetic algorithms as well as hand-picked augmentation techniques. Here’s a few examples of augmentation at work for the object classifier:

该团队使用遗传算法以及精心挑选的增强技术，在超参数组合方面做了大量工作。这是一些为对象分类器工作的增强示例：

By comparing individual and combined performances of several augmentation techniques, the team figured out that mixing them together produced the best result — boosting 1st-guess and 5-guess accuracy by 1.9% & 1.2% for CSPResNeXt-50 and 1.5% & 1.2% for CSPDarknet-53.

通过比较几种增强技术的单个和组合性能，研究小组发现将它们混合在一起可产生最佳结果-CSPResNeXt-50的1次猜测和5次猜测的准确度分别提高了1.9％和1.2％，CSPResNeXt-50分别提高了1.5％和1.2％ CSPDarknet-53。

The theory holds up well: a more diverse dataset with more surprises mixed in does better overall. Any extra accuracy without extra time cost is very welcome; keep in mind that 1.5% could be the difference between a truck labeling you “pedestrian” or “turkey”.

该理论很好地证明了：混合了更多惊喜的更加多样化的数据集总体上会更好。非常欢迎没有额外时间成本的任何额外精度；请记住，标有“行人”或“土耳其”的卡车之间的差额可能是1.5％。

For detecting objects, there’s a ton of extra Freebies to be considered:

为了检测物体，需要考虑大量额外的免费赠品：

S: alters grid sensitivity = multiplying sigmoid by x > 1
S：改变栅格灵敏度=将S形乘以x> 1
M: Mosaic
M：马赛克
IT: multiple IoU thresholds for anchor boxes (tolerance/strictness)
IT：锚框的多个IoU阈值(公差/严格度)
GA: Genetic algorithms for hyperparameter tuning in first 10% of training periods
GA：用于训练前10％的超参数调整的遗传算法
LS: Class label smoothing for sigmoid optimization
LS：用于S型优化的类标签平滑
CBN: Cross Batch Normalization to learn beyond minibatches
CBN：跨批次规范化可超越微型批次学习
CA: Cosine annealing scheduler to alter learning rate during sinusoid training
CA：余弦退火调度程序可在正弦波训练期间改变学习率
DM: Dynamic minibatch size, alters size for small resolution objects with random shapes
DM：动态小批量大小，可更改具有随机形状的小分辨率对象的大小
OA: Using optimized anchors for 512x512 input resolution
OA：针对512x512输入分辨率使用优化的锚点

The team then compared CPResNeXt-50 and CSPDarknet53 as backbones. Resnext is better at classification, while Darknet is better at detection. However they respond differently to BoF & Mish improvements: Resnext gains classification accuracy but loses detection power, while Darknet gains both, so Darknet was chosen as YOLOv4’s final backbone.

然后，团队将CPResNeXt-50和CSPDarknet53作为骨干进行了比较。 Resnext的分类效果更好，而Darknet的检测效果更好。但是，他们对BoF＆Mish改进的React不同：Resnext获得了分类准确性，但失去了检测能力，而Darknet获得了两者，因此Darknet被选为YOLOv4的最终骨干。

They also looked into different minibatch sizes, but it had almost no effect on model accuracy (probably due to the comprehensive image augmentation).

他们还研究了不同的小批量大小，但对模型精度几乎没有影响(可能是由于全面的图像增强)。

结论：重大进展 (Conclusion: Significant Progress)

There’s nothing better than a big, vindicating graph to show off your hard work:

没有什么比一个大的，能证明事实的图表更好地展示您的辛勤工作了：

Note the “real-time” area of >30 FPS marked in blue: these are the nets applicable to detecting objects in motion. It’s clear to see that YOLOv4 is the king of this category.

请注意，> 30 FPS的“实时”区域用蓝色标记：这些网络适用于检测运动中的物体。很明显，YOLOv4是该类别的王者。

And to top it off, this performance was achieved with relatively normal hardware, training and running off of one industry-standard GPU. They’ve certainly achieved their stated goal of making a quick, accurate object prediction net available for personal use.

最重要的是，通过相对正常的硬件，培训和运行一个行业标准的GPU即可达到此性能。他们当然已经实现了既定目标，即可以提供快速，准确的对象预测网供个人使用。

What makes this case interesting is the careful consideration of many architectural variables. In their paper, the team does a great job of detailing why existing recognition nets are good and bad in certain areas, and how they altered existing modules to keep computation light while extracting more features at a faster pace.

使这个案例有趣的是，仔细考虑了许多架构变量。在他们的论文中，该团队做了出色的工作，详细说明了为什么现有识别网络在某些方面是好是坏，以及他们如何更改现有模块以保持计算轻巧，同时以更快的速度提取更多特征。

I highly recommend checking out their GitHub; they make it relatively easy to install dependencies and get it running with your own videos. I’ve been digging into their code to see if I can plug in one of those Ghost modules I wrote about to cut runtime even further.

我强烈建议您查看他们的GitHub；它们使安装依赖项并使其与您自己的视频一起运行相对容易。我一直在研究他们的代码，以查看是否可以插入我写的那些Ghost模块之一，以进一步缩短运行时间。