深度学习算法原理_用于对象检测的深度学习算法的基本原理-CSDN博客

深度学习算法原理

You just got a new drone and you want it to be super smart! Maybe it should detect whether workers are properly wearing their helmets or how big the cracks on a factory rooftop are.

您刚刚拥有一架新无人机，并希望它变得超级聪明！也许它应该检测出工人是否正确地戴好头盔或工厂屋顶上的裂缝有多大。

In this blog post, we’ll look at the basic methods of object detection (Exhaustive Search, R-CNN, Fast R-CNN and Faster R-CNN) and try to understand the technical details of each model. The best part? We’ll do all of this without any formula, allowing readers with all levels of experience to follow along!

在这篇博客文章中，我们将研究对象检测的基本方法(穷举搜索，R-CNN，Fast R-CNN和Faster R-CNN)，并尝试了解每种模型的技术细节。最好的部分？我们将在没有任何公式的情况下完成所有这些工作，让具有各种经验的读者可以跟随！

Finally, we will follow this post with a second one, where we will take a deeper dive into Single Shot Detector (SSD) networks and see how this can be deployed… on a drone.

最后，我们将在第二篇文章之后继续讨论，我们将更深入地研究Single Shot Detector(SSD)网络，并了解如何在无人机上进行部署。

我们进入物体检测的第一步 (Our First Steps Into Object Detection)

它是一只鸟吗？是飞机吗？ —图像分类 (Is It a Bird? Is It a Plane?— Image Classification)

Object detection (or recognition) builds on image classification. Image classification is the task of — you guessed it—classifying an image (via a grid of pixels like shown above) into a class category. For a refresher on image classification, we refer the reader to this post.

对象检测(或识别)基于图像分类。图像分类是您的任务(您猜对了)，将图像(通过如上所示的像素网格)分类为类类别。有关图像分类的复习，请向读者介绍该帖子。

Object recognition is the process of identifying and classifying objects inside an image, which looks something like this:

对象识别是对图像中的对象进行识别和分类的过程，如下所示：

In order for the model to be able to learn the class and the position of the object in the image, the target has to be a five-dimensional label (class, x, y, width, length).

为了使模型能够学习图像中对象的类别和位置，目标必须是五维标签(类别，x，y，宽度，长度)。

对象检测方法的内部工作 (The Inner Workings of Object Detection Methods)

一种计算昂贵的方法：穷举搜索 (A Computationally Expensive Method: Exhaustive Search)

The simplest object detection method is using an image classifier on various subparts of the image. Which ones, you might ask? Let’s consider each of them:

最简单的对象检测方法是在图像的各个子部分上使用图像分类器。您可能会问哪些？让我们考虑其中的每个：

1. First, take the image on which you want to perform object detection.

1.首先，拍摄要在其上执行对象检测的图像。

baseball players sliding into each other

2. Then, divide this image into different sections, or “regions”, as shown below:

2.然后，将该图像划分为不同的部分或“区域”，如下所示：

Dividing the image into regions for analysis.

3. Consider each region as an individual image.

3.将每个区域视为一个单独的图像。

4. Classify each image using a classic image classifier.

4.使用经典图像分类器对每个图像进行分类。

5. Finally, combine all the images with the predicted label for each region where one object has been detected.

5.最后，针对检测到一个物体的每个区域，将所有图像与预测标签合并。

Each region of the image receives a label.

One problem with this method is that objects can have different aspect ratios and spatial locations, which can lead to unnecessarily expensive computations of a large number of regions. It presents too big of a bottleneck in terms of computation time to be used for real-life problems.

该方法的一个问题是对象可能具有不同的纵横比和空间位置，这可能导致大量区域的不必要的昂贵计算。就计算时间而言，它存在太大的瓶颈，无法用于实际问题。

区域提案方法和选择性搜索 (Region Proposal Methods and Selective Search)

A more recent approach is to break down the problem into two tasks: detect the areas of interest first and then perform image classification to determine the category of each object.

最近的一种方法是将问题分解为两个任务：首先检测感兴趣的区域，然后执行图像分类以确定每个对象的类别。

The first step usually consists in applying region proposal methods. These methods output bounding boxes that are likely to contain objects of interest. If the object has been properly detected in one of the region proposals, then the classifier should detect it as well. That’s why it’s important for these methods to not only be fast, but also to have a very high recall.

第一步通常包括应用区域提议方法 。这些方法输出可能包含感兴趣对象的边界框。如果在区域建议之一中已正确检测到对象，则分类器也应将其检测到。这就是为什么这些方法不仅要快速而且要具有很高的召回率很重要的原因。

These methods also use a clever architecture where part of the image preprocessing is the same for the object detection and for the classification tasks, making them faster than simply chaining two algorithms. One of the most frequently used region proposal methods is selective search:

这些方法还使用了一种聪明的体系结构，其中对于对象检测和分类任务，图像预处理的一部分是相同的，这使得它们比简单地链接两个算法更快。选择区域搜索是最常用的区域提议方法之一：

Its first step is to apply image segmentation, as shown here:

第一步是申请 图像分割 ，如下所示：

From the image segmentation output, selective search will successively:

从图像分割输出中，选择性搜索将依次进行：

Create bounding boxes from the segmented parts and add them to the list of region proposals.
从分割的零件中创建边界框，并将其添加到区域投标列表中。
Combine several small adjacent segments to larger ones based on four types of similarity: color, texture, size, and shape.
根据四种相似度：颜色，纹理，大小和形状，将几个相邻的小片段合并为较大的片段。
Go back to step one until the section covers the entire image.
返回第一步，直到该部分覆盖了整个图像。

Hierarchical grouping — Hierarchical Grouping

Now that we understand how selective search works, let’s introduce some of the most popular object detection algorithms that leverage it.

现在我们了解了选择性搜索的工作原理，让我们介绍一些利用它的最受欢迎的对象检测算法。

第一个目标检测算法：R-CNN (A First Object Detection Algorithm: R-CNN)

Ross Girshick et al. proposed Region-CNN (R-CNN) which allows the combination of selective search and CNNs. Indeed, for each region proposal (2000 in the paper), one forward propagation generates an output vector through a CNN. This vector will be fed to a one-vs-all classifier (i.e. one classifier per class, for instance one classifier where labels = 1 if the image is a dog and 0 if not, a second one where labels = 1 if the image is a cat and 0 if not, etc), SVM is the classification algorithm used by R-CNN.

Ross Girshick等。提议的Region-CNN(R-CNN)，它允许选择性搜索和CNN结合使用。实际上，对于每个区域提议(本文中为2000)，一个前向传播都会通过CNN生成输出矢量。该向量将被馈送给全对一分类器 (即，每个类别一个分类器，例如，一个分类器，如果图像是狗，则标签= 1，否则为0；如果图像是狗，则第二标签，标签= 1一个cat，如果不是则为0，等等)， SVM是R-CNN使用的分类算法。

But how do you label the region proposals? Of course, if it perfectly matches our ground truth we can label it as 1, and if a given object is not present at all, we can then label it 0 for this object. What if a part of an object is present in the image? Should we label the region as 0 or 1? To make sure we are training our classifier on regions that we can realistically have when predicting an image (and not only perfectly matching regions), we are going to look at the intersection over union (IoU) of the boxes predicted by the selective search and the ground truth:

但是，您如何标记区域提案？当然，如果它完全符合我们的基本事实，则可以将其标记为1，如果根本不存在给定的对象，则可以将该对象标记为0。如果图像中存在物体的一部分怎么办？我们应该将区域标记为0还是1？为了确保我们在分类器上训练我们在预测图像时可以实际拥有的区域(不仅是完全匹配的区域)，我们将研究选择性搜索和预测的框的相交交点 (IoU)基本事实：

The IoU is a metric represented by the area of overlap between the predicted and the ground truth boxes divided by their area of union. It rewards successful pixel detection and penalizes false positives in order to prevent algorithms from selecting the whole image.

IoU是一个度量标准，由预测的和基本的真实框之间的重叠面积除以它们的并集面积表示。它奖励成功的像素检测并惩罚误报，以防止算法选择整个图像。

Going back to our R-CNN method, if the IoU is lower than a given threshold (0.3), then the associated label would be 0.

回到我们的R-CNN方法，如果IoU低于给定的阈值(0.3)，则关联的标签将为0。

After running the classifier on all region proposals, R-CNN proposes to refine the bounding box (bbox) using a class-specific bbox regressor. The bbox regressor can fine-tune the position of the bounding box boundaries. For example, if the selective search has detected a dog but only selected half of it, the bbox regressor, which is aware that dogs have four legs, will ensure that the whole body is selected.

在所有区域建议上运行分类器后，R-CNN建议使用特定于类的bbox回归变量来优化边界框(bbox)。 bbox回归器可以微调边界框边界的位置。例如，如果选择性搜索已检测到一条狗，但只选择了其中一半，则bbox回归器会知道狗有四条腿，它将确保整个身体都被选中。

Also thanks to the new bbox regressor prediction, we can discard overlapping proposals using non-maximum suppression (NMS). Here, the idea is to identify and delete overlapping boxes of the same object. NMS sorts the proposals per classification score for each class and computes the IoU of the predicted boxes with the highest probability score with all the other predicted boxes (of the same class). It then discards the proposals if the IoU is higher than a given threshold (e.g., 0.5). This step is then repeated for the next best probabilities.

同样，由于有了新的bbox回归预测，我们可以使用非最大抑制 (NMS)丢弃重叠的提案。这里的想法是识别并删除同一对象的重叠框。 NMS对每个类别的每个分类分数对建议进行排序，并计算所有其他(相同类别的)预测方框与最高概率分数的预测方框的IoU。然后，如果IoU高于给定的阈值(例如0.5)，它将丢弃建议。然后针对下一个最佳概率重复此步骤。

To sum up, R-CNN follows the following steps:

综上所述，R-CNN遵循以下步骤：

Create region proposals from selective search (i.e, predict the parts of the image that are likely to contain an object).
通过选择性搜索创建区域建议(即，预测图像中可能包含对象的部分)。
Run these regions through a pre-trained model and then a SVM to classify the sub-image.
通过预先训练的模型运行这些区域，然后通过SVM对子图像进行分类。
Run the positive prediction through a bounding box prediction which allows for a better box accuracy.
通过边界框预测来运行正向预测，这样可以提高框的准确性。
Apply an NMS when predicting to get rid of overlapping proposals.
预测摆脱重叠的提案时，请应用NMS。

There are, however, some issues with R-CNN:

但是，R-CNN存在一些问题：

This method still needs to classify all the region proposals which can lead to computational bottlenecks — it’s not possible to use it for a real-time use case.
该方法仍然需要对所有可能导致计算瓶颈的区域提议进行分类-不可能将其用于实时用例。
No learning happens at the selective search stage, which can lead to bad region proposals for certain types of datasets.
在选择性搜索阶段不会学习，这可能会导致针对某些类型的数据集提出错误的区域建议。

边际改进：快速R-CNN (A Marginal Improvement: Fast R-CNN)

Fast R-CNN — as its name indicates — is faster than R-CNN. It is based on R-CNN with two differences:

快速R-CNN(顾名思义)比R-CNN快。它基于R-CNN，但有两个区别：

Instead of feeding the CNN for every region proposal, you feed the CNN only once by taking the whole image to generate a convolutional feature map (take a vector of pixels and transform it into another vector using a filter which will give you a convolutional feature map — you can find more info here). Next, the region of proposals are identified with selective search and then they are reshaped into a fixed size using a Region of Interest pooling (RoI pooling) layer to be able to use as an input of the fully connected layer.
无需为每个区域提案提供CNN，只需通过获取整个图像以生成卷积特征图即可获取CNN (获取像素矢量，并使用过滤器将其转换为另一个矢量，这将为您提供卷积特征图—您可以在此处找到更多信息)。接下来，通过选择性搜索来确定提案区域，然后使用兴趣区域池( RoI池 )层将其重塑为固定大小，以用作完全连接层的输入。
Fast-RCNN uses the softmax layer instead of SVM in its classification of region proposals which is faster and generates a better accuracy.
Fast-RCNN在区域提案分类中使用softmax层而不是SVM ，这样可以更快并产生更好的准确性。

Here is the architecture of the network:

这是网络的体系结构：

As we can see in the figure below, Fast R-CNN is way faster at training and testing than R-CNN. However, a bottleneck still remains due to the selective search method.

如下图所示，快速R-CNN在训练和测试方面比R-CNN快得多。但是，由于选择搜索方法，仍然存在瓶颈。

R-CNN可以获得多快？ —更快的R-CNN (How Fast Can R-CNN Get? — FASTER R-CNN)

While Fast R-CNN was a lot faster than R-CNN, the bottleneck remains with selective search as it is very time consuming. Therefore, Shaoqing Ren et al. came up with Faster R-CNN to solve this and proposed to replace selective search by a very small convolutional network called Region Proposal Network (RPN) to find the regions of interest.

尽管快速R-CNN比R-CNN快很多，但选择性搜索仍然存在瓶颈，因为这非常耗时。因此，邵少任等。提出了Faster R-CNN来解决此问题，并建议用一个称为区域提议网络 (RPN)的非常小的卷积网络来代替选择搜索，以找到感兴趣的区域。

In a nutshell, RPN is a small network that directly finds region proposals.

简而言之，RPN是一个直接查找区域提议的小型网络。

One naive approach to this would be to create a deep learning model which outputs x_min, y_min, x_max, and x_max to get the bounding box for one region proposal (so 8,000 outputs if we want 2,000 regions). However, there are two fundamental problems:

一种简单的方法是创建一个深度学习模型，该模型输出x_min，y_min，x_max和x_max来获取一个区域建议的边界框(因此，如果我们要2,000个区域，则需要8,000个输出)。但是，存在两个基本问题：

The images can have very different sizes and ratios, so to create a model correctly predicting raw coordinates can be tricky.
图像的大小和比例可能非常不同，因此要创建一个模型来正确预测原始坐标可能会很棘手。
There are some coordinate ordering constraints in our prediction (x_min < x_max, y_min < y_max).
我们的预测中有一些坐标排序约束(x_min <x_max，y_min <y_max)。

To overcome this, we are going to use anchors:

为了克服这个问题，我们将使用锚点：

Anchors are predefined boxes of different ratios and scales all over the image. For example, for a given central point, we usually start with three sets of sizes (e.g., 64px, 128px, 256px) and three different width/height ratios (1/1, ½, 2/1). In this example, we would end up having nine different boxes for a given pixel of the image (the center of our boxes).

锚点是预定义的框，它们具有不同的比例，并在整个图像上缩放。例如，对于给定的中心点，我们通常以三组尺寸(例如64px，128px，256px)和三种不同的宽高比(1 / 1、1 / 2、2 / 1)开始。在此示例中，对于图像的给定像素(框的中心)，我们最终将拥有九个不同的框。

So how many anchors would I have in total for one image?

那么一张图像总共要有多少个锚？

It is paramount to understand that we are not going to create anchors on the raw images, but on the output feature maps on the last convolutional layer. For instance, it’s false to say that for a 1,000*600 input image we would have one anchor per pixel so 1,000*600*9 = 5,400,000 anchors. Indeed, since we are going to create them on the feature map, there is a subsampling ratio to take into account (which is the factor reduction between the input and the output dimension due to strides in our convolutional layer).

理解我们不会在原始图像上创建锚点而是在最后一个卷积层上的输出要素图上创建锚点至关重要。例如，对于1,000 * 600输入的图像，我们每个像素只有一个锚，因此1,000 * 600 * 9 = 5,400,000个锚是错误的。确实，由于我们将在特征图上创建它们，因此需要考虑一个子采样率(这是由于卷积层中的步幅而导致的输入和输出尺寸之间的因数减少)。

In our example, if we take this ratio to be 16 (like in VGG16) we would have nine anchors per spatial position of the feature map so “only” around 20,000 anchors (5,400,000 / 16²). This means that two consecutive pixels in the output features correspond to two points which are 16 pixels apart in the input image. Note that this down sampling ratio is a tunable parameter of Faster R-CNN.

在我们的示例中，如果我们将该比率设为16(例如在VGG16中)，则特征图的每个空间位置将具有9个锚，因此“仅”大约20,000个锚(5,400,000 /16²)。这意味着输出要素中的两个连续像素对应于输入图像中相隔16个像素的两个点。请注意，此下采样率是Faster R-CNN的可调参数。

The remaining question now is how to go from those 20,000 anchors to 2,000 region proposals (taking the same number of region proposals as before), which is the goal of our RPN.

现在剩下的问题是如何从这20,000个锚点到2,000个区域提案(采用与以前相同数量的区域提案)，这是我们RPN的目标。

如何训练区域提案网络 (How to Train the Region Proposal Network)

To achieve this, we want our RPN to tell us whether a box contains an object or is a background, as well as the accurate coordinates of the object. The output predictions are probability of being background, probability of being foreground, and the deltas Dx, Dy, Dw, Dh which are the difference between the anchor and the final proposal).

为了实现这一点，我们希望RPN告诉我们盒子是否包含对象或背景，以及对象的准确坐标。输出预测是作为背景的概率，作为前景的概率以及增量Dx，Dy，Dw，Dh(这是锚点与最终建议之间的差)。

First, we will remove the cross-boundary anchors (i.e. the anchors which are cut due to the border of the image) — this left us with around 6,000 images.
首先，我们将删除跨边界锚点(即由于图像边框而被剪切的锚点)，这使我们获得了大约6,000张图像。
We need to label our anchors positive if either of the two following conditions exist:
如果存在以下两个条件之一，则需要将锚标记为正：

→ The anchor has the highest IoU with a ground truth box among all the other anchors.

→与其他所有锚点相比，该锚点具有最高的IoU，并带有地面真理框。

→ The anchor has at least 0.7 of IoU with a ground truth box.

→锚点至少有0.7的IoU，并带有地面真相框。

We need to label our anchors negative if its IoU is less than 0.3 with all ground truth boxes.
如果所有地面真值框的IoU小于0.3，我们需要将锚标记为负。
We disregard all the remaining anchors.
我们忽略所有剩余的锚点。
We train the binary classification and the bounding box regression adjustment.
我们训练二进制分类和边界框回归调整。

Finally, a few remarks about the implementation:

最后，关于实现的一些说明：

We want the number of positive and negative anchors to be balanced in our mini batch.
我们希望在迷你批中平衡正锚和负锚的数量。
We use a multi-task loss, which makes sense since we want to minimize either loss — the error of mistakenly predicting foreground or background and also the error of accuracy in our box.
我们使用多任务损失，这是有道理的，因为我们要最大程度地减少损失(错误地预测前景或背景的误差以及盒子中的准确性误差)。
We initialize the convolutional layer using weights from a pre-trained model.
我们使用来自预训练模型的权重来初始化卷积层。

如何使用地区提案网络 (How to Use the Region Proposal Network)

All the anchors (20,000) are scored so we get new bounding boxes and the probability of being a foreground (i.e., being an object) for all of them.
对所有锚点(20,000)进行了评分，因此我们得到了新的边界框，并获得了所有边界框的前景(即，成为对象)的可能性。
Use non-maximum suppression (see the R-CNN section)
使用非最大抑制(请参阅“ R-CNN”部分)
Proposal selection: Finally, only the top N proposals sorted by score (with N=2,000, we are back to our 2,000 region proposals) are kept.
提案选择：最后，仅保留按得分排序的前N个提案(N = 2,000，我们回到了2,000个区域提案)。

We finally have our 2,000 proposals like in the previous methods. Despite appearing more complex, this prediction step is way faster and more accurate than the previous methods.

像以前的方法一样，我们终于有了2,000个提案。尽管看起来更复杂，但此预测步骤比以前的方法更快，更准确。

The next step is to create a similar model as in Fast R-CNN (i.e. RoI pooling, and a classifier + bbox regressor), using RPN instead of selective search. However, we don’t want to do exactly as before, i.e. take the 2,000 proposals, crop them, and pass them through a pre-trained base network. Instead, reuse the existing convolutional feature map. Indeed, one of the advantages of using an RPN as a proposal generator is to share the weights and CNN between the RPN and the main detector network.

下一步是使用RPN而不是选择性搜索来创建与Fast R-CNN中类似的模型(即RoI池和分类器+ bbox回归器)。但是，我们不想像以前那样做，即拿出2,000个提案，将其裁剪，然后通过预先训练的基础网络传递。相反，请重用现有的卷积特征图 。确实，使用RPN作为提议生成器的优点之一是在RPN和主检测器网络之间共享权重和CNN。

The RPN is trained using a pre-trained network and then fine-tuned.
使用预训练网络对RPN进行训练，然后进行微调。
The detector network is trained using a pre-trained network and then fine-tuned. Proposal regions from the RPN are used.
使用预训练网络对探测器网络进行训练，然后进行微调。使用RPN中的提案区域。
The RPN is initialized using the weights from the second model and then fine-tuned—this is going to be our final RPN model).
RPN使用第二个模型中的权重进行初始化，然后进行微调(这将成为我们最终的RPN模型)。
Finally, the detector network is fine-tuned (RPN weights are fixed). The CNN feature maps are going to be shared amongst the two networks (see next figure).
最后，对探测器网络进行微调(RPN权重是固定的)。 CNN功能图将在两个网络之间共享(请参见下图)。

Region proposal network example — Faster R-CNN network

To sum up, Faster R-CNN is more accurate than the previous methods and is about 10 times faster than Fast-R-CNN, which is a big improvement and a start for real-time scoring.

综上所述，Faster R-CNN比以前的方法更准确，比Fast-R-CNN快10倍，这是一个很大的进步，并且是实时评分的起点。

Even still, region proposal detection models won’t be enough for an embedded system since these models are heavy and not fast enough for most real-time scoring cases — the last example is about five images per second.

即便如此，区域提议检测模型对于嵌入式系统还是不够的，因为这些模型很笨重，并且对于大多数实时评分情况而言不够快-最后一个例子是每秒约五张图像。

In our next post, we will discuss faster methods like SSD and real use cases with image detection from drones.

在我们的下一篇文章中，我们将讨论更快的方法，例如SSD和从无人机进行图像检测的实际用例。

We’re excited to be working on this topic for Dataiku DSS — check out the additional resources below to learn more:

我们很高兴能为Dataiku DSS致力于这个主题-请查看以下其他资源以了解更多信息：

Object Detection plugin for Dataiku DSS projects,
Dataiku DSS项目的对象检测插件，
a NATO challenge we won at Dataiku using Object Detection algorithm.
我们使用对象检测算法在Dataiku赢得了北约挑战。