检测和语义分割_分割和对象检测-第4部分-CSDN博客

检测和语义分割

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

这些是FAU YouTube讲座“ 深度学习 ”的讲义。 这是演讲视频和匹配幻灯片的完整记录。 我们希望您喜欢这些视频。 当然，此成绩单是使用深度学习技术自动创建的，并且仅进行了较小的手动修改。 自己尝试！ 如果发现错误，请告诉我们！

导航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一个讲座 / 观看此视频 / 顶级 / 下一个讲座

Image for post — Today’s topic is real-time object detection in complex scenes. Image created using gifify. Source: YouTube

Welcome back to deep learning! So today, we want to discuss the single-shot detectors and how we can actually approach real-time object detection.

欢迎回到深度学习！因此，今天，我们要讨论单发检测器以及如何实际进行实时目标检测。

Okay, the fourth part of segmentation and object detection — the single-shot detectors. So, can’t we just use the region proposal network as a detector in you look only once fashion? This is the idea of YOLO that is a single-shot detector. You only look once — you combine the bounding box prediction and the classification into a single network.

好的，分割和对象检测的第四部分-单次检测器。因此，难道我们不能仅将区域提议网络用作检测器吗？这就是YOLO的想法，它是单发检测器。您只需看一次-将边界框预测和分类组合到一个网络中。

This is done by subdividing the image essentially into S times S cells and for every cell, you do in parallel the class probability map computation and you produce bounding boxes and confidence. This then gives you for each cell B bounding boxes with a confidence score and the class confidence and that is produced from a CNN. So the CNN predicts S times S times (5 B + C) values, where C is the number of classes. In the end, to produce the final object detection, you compute the overlap of the bounding box with the respective class probability map. This then allows you to compute the average within this bounding box to produce the final class of that respective object. This way you are able to solve complex scenes like this one and this is really real-time.

这是通过将图像本质上细分为S×S个单元格来完成的，对于每个单元格，您并行执行类概率图计算，并产生边界框和置信度。然后，这会为您提供由CNN产生的具有置信度得分和类别置信度的每个单元格B边界框。因此，CNN会预测S乘S乘(5 B + C)的值，其中C是类别数。最后，要生成最终的对象检测，您需要计算边界框与相应类别概率图的重叠。然后，您可以在此边界框中计算平均值，以生成相应对象的最终类。这样，您就可以解决像这样的复杂场景，而且这是实时的。

So there’s YOLO9000 which is an improved version of YOLO which is advertised as better, faster, and stronger. So it’s better because the batch normalization is used. They also do high-res classification to improve the mean average precision by up to 6%. The anchor boxes that are found by the clustering over the training data improves the recall by 7%. Training over multiple scales allows YOLO9000 to detect objects at different resolutions more easily. It’s faster because it’s using a difference CNN architecture which speeds up the forward pass. Finally, it’s stronger because it has this hierarchical detection on a tree that allows combining different object detection datasets. All in this allows YOLO9000 to detect up to 9,000 classes in real-time or faster.

因此，有YOLO9000是YOLO的改进版本，它被宣传为更好，更快，更强大。因此，最好使用批处理规范化。他们还进行高分辨率分类，以将平均平均精度提高多达6％。通过对训练数据进行聚类发现的锚框将召回率提高了7％。通过多尺度的训练，YOLO9000可以更轻松地检测不同分辨率的物体。它更快，因为它使用了不同的CNN架构，可加快前进速度。最后，它更强大，因为它在树上具有此分层检测功能，可以组合不同的对象检测数据集。所有这些使YOLO9000可以实时或更快地检测多达9,000个类别。

There is also the single-shot multi-box detector in [24]. It’s a popular alternative to YOLO. It is also a single-shot detector like Yolo with only one forward pass through the CNN.

在[24]中也有单发多盒检测器。它是YOLO的流行替代品。它也是像Yolo一样的单发检测器，仅向前通过CNN。

It’s called multi-box because this is the name of the bounding box regression technique in [15] and it’s obviously an object detector. It differs from YOLO in several aspects but shares the same core idea.

之所以称为多框，是因为它是[15]中边界框回归技术的名称，并且显然是对象检测器。它在某些方面与YOLO不同，但是具有相同的核心思想。

Now, you still have a problem with multiple resolutions. In particular, if you think about tasks like histological images that have a very, very high resolution. Then, you can also work with detectors like RetinaNet. It is essentially using a ResNet CNN encoder/decoder. It’s very similar to what we’ve already seen in image segmentation. It’s using a feature pyramid net that allows you to couple the different feature maps that are produced with the original input images that are generated from the decoder. So you could say it’s very similar to a U-net. In contrast to U-net, it does a class and box prediction using a subnet on each of the scales of the feature pyramid net. So, you could say it’s a single-shot detector that uses U-net simultaneously for the class and box prediction. Also, it uses the focal loss that we will talk about in a couple of slides.

现在，您仍然有多种分辨率问题。特别是，如果您考虑具有高分辨率的组织学图像之类的任务。然后，您还可以使用RetinaNet等检测器。它实质上是使用ResNet CNN编码器/解码器。这与我们在图像分割中已经看到的非常相似。它使用的是特征金字塔网，可以将生成的不同特征图与从解码器生成的原始输入图像耦合。因此，您可以说它与U-net非常相似。与U-net相比，它使用特征金字塔网的每个尺度上的子网进行类和框预测。因此，您可以说这是一个单发检测器，它同时使用U-net进行类和盒预测。此外，它使用了我们将在几张幻灯片中讨论的焦点损失。

Let’s look a bit at the tradeoff in speed and accuracy. You can see that generally, networks that are very accurate are not so fast. So, here you see on the x-axis the GPU time and on the y-axis the overall mean average precision. You can see that you can combine the architectures like single-shot detectors, RCNN, or ideas like faster RCNN in combination with different feature extractors like Inception-ResNet, Inception, and so on. This allows us to produce many different combinations. You can see that if you spend more time on the computation, then you typically can also increase the accuracy and this is reflected in this graph.

让我们看一下速度和准确性之间的权衡。您可以看到，通常情况下，非常准确的网络并没有那么快。因此，在这里您可以在x轴上看到GPU时间，在y轴上可以看到总体平均平均精度。您会看到，您可以将诸如单发检测器，RCNN或更快的RCNN之类的架构与诸如Inception-ResNet，Inception等不同的特征提取器结合使用。这使我们能够产生许多不同的组合。您会看到，如果您花费更多的时间进行计算，那么通常还可以提高准确性，这可以在此图中反映出来。

The class imbalance is key to tackle the speed-accuracy tradeoff. All of those single-shot detectors evaluate many hypothesis locations. Most of them are really easy negatives. So, this imbalance is not addressed by the current training. In classical methods, we typically dealt with this with hard-negative mining. Now, the question is “Can we change the loss function to pay less attention to easy examples?”.

班级失衡是解决速度精度折衷的关键。所有这些单发检测器都会评估许多假设位置。它们中的大多数实际上都是简单的负片。因此，当前的培训无法解决这种不平衡问题。在经典方法中，我们通常通过硬负数挖掘来解决这个问题。现在的问题是“我们是否可以更改损失函数，以减少对简单示例的关注？”。

This idea exactly brings us to the focal loss. Here, we can essentially define the objectness whether it’s an object or not as binary. Then, you can model this as a Bernoulli distribution. The usual loss would be simply the cross-entropy where you have the minus logarithm of the correct class. You can now see that we can adjust this to the so-called focal loss. Here, we introduce an additional parameter α. α is the imbalance weight calculated as the inverse class frequency. Additionally, we introduced some γ that is a hyper-parameter. This allows decreasing the influence of easy examples. So, you can see the influence of γ here on the plot on the left-hand side. The more you increase γ is the more peaked will your respective weight be such that you can then really concentrate on classes that are not very frequent.

这个想法确实使我们陷入了焦点损失。在这里，我们基本上可以将对象定义为二进制对象。然后，您可以将此模型建模为伯努利分布。通常的损失就是交叉熵，即您具有正确类别的负对数。现在您可以看到我们可以将其调整为所谓的焦点损失。在这里，我们介绍一个附加参数α。 α是计算为逆类频率的不平衡权重。此外，我们介绍了一些超参数γ。这样可以减少简单示例的影响。因此，您可以在这里在左侧图上看到γ的影响。您增加的γ越大，您各自的权重就越达到峰值，这样您就可以真正专注于不太频繁的课程。

So, let’s summarize object detection. The main task is detecting bounding boxes and associated classification. The sliding window approach was extremely inefficient. The region proposal networks reduce the number of candidates but if you really want to go towards real-time then you have to use single-shot detectors like YOLO to avoid additional steps. Object detector concepts can, of course, be combined with arbitrary feature extraction and classification networks as we’ve seen earlier. Also, keep in mind the speed-accuracy tradeoff. So, if you want to be very quick then you, of course, reduce the number of bounding boxes that are predicting because then you are much faster but then you may miss true positives.

因此，让我们总结一下对象检测。主要任务是检测边界框和相关分类。滑动窗口方法效率极低。区域提议网络减少了候选者的数量，但是如果您真的想实现实时，那么就必须使用YOLO之类的单发检测器来避免其他步骤。当然，可以将对象检测器概念与任意特征提取和分类网络相结合，如我们先前所见。另外，请记住速度精度的权衡。因此，如果您想非常快，那么您当然要减少所预测的边界框的数量，因为那样您会快得多，但是您可能会错过真正的积极优势。

So, we now discussed segmentation. We now discussed object detection and how to do object detection very quickly. So next time, we will look into the fusion of both which is going to be instance segmentation. So, thank you very much for watching this video and I’m looking forward to seeing you in the next one.

因此，我们现在讨论分割。现在，我们讨论了对象检测以及如何非常快速地进行对象检测。因此，下一次，我们将研究两者的融合，这将是实例分割。因此，非常感谢您观看此视频，我期待与您在下一个视频中见面。

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep Learning Lecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜欢这篇文章，你可以找到这里更多的文章，更多的教育材料，机器学习在这里，或看看我们的深入学习讲座。如果您希望将来了解更多文章，视频和研究信息，也欢迎关注YouTube ， Twitter ， Facebook或LinkedIn 。本文是根据知识共享4.0署名许可发布的，如果引用，可以重新打印和修改。如果您对从视频讲座中生成成绩单感兴趣，请尝试使用AutoBlog 。