How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 1

最新推荐文章于 2020-12-05 18:31:57 发布

Yongqiang Cheng

最新推荐文章于 2020-12-05 18:31:57 发布

阅读量1k

点赞数 1

object detection - 目标检测专栏收录该内容

27 篇文章 5 订阅

订阅专栏

How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 1

从零开始 PyTorch 项目：YOLO v3 目标检测实现 (第一部分)

https://blog.paperspace.com/
https://blog.paperspace.com/tag/series-yolo/
https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

Tutorial on building YOLO v3 detector from scratch detailing how to create the network architecture from a configuration file, load the weights and designing input/output pipelines.
从头开始构建 YOLO v3 检测器的教程详细介绍了如何从配置文件创建网络体系结构，加载权重以及设计输入/输出 pipelines。

在这里插入图片描述

Object detection is a domain that has benefited immensely from the recent developments in deep learning. Recent years have seen people develop many algorithms for object detection, some of which include YOLO, SSD, Mask RCNN and RetinaNet.
目标检测是深度学习近期发展过程中受益最多的领域。随着技术的进步，人们已经开发出了很多用于目标检测的算法，包括 YOLO、SSD、Mask RCNN 和 RetinaNet。

immensely [ɪˈmensli]：adv. 极其地，非常地
takeaway [ˈteɪkəweɪ]：n. 外卖食品，外卖餐馆，要点
entirety [ɪnˈtaɪərəti]：n. 全部，完全
credit [ˈkredɪt]：n. 信用，信誉，贷款，学分，信任，声望 vt. 相信，信任，把...归给，归功于，赞颂

For the past few months, I’ve been working on improving object detection at a research lab. One of the biggest takeaways from this experience has been realizing that the best way to go about learning object detection is to implement the algorithms by yourself, from scratch. This is exactly what we’ll do in this tutorial.
在过去的几个月中，我一直在实验室中研究提升目标检测的方法。在这之中我获得的最大启发就是意识到：学习目标检测的最佳方法就是自己动手实现这些算法，而这正是本教程引导你去做的。

We will use PyTorch to implement an object detector based on YOLO v3, one of the faster object detection algorithms out there.
在本教程中，我们将使用 PyTorch 实现基于 YOLO v3 的目标检测器，后者是一种快速的目标检测算法。

The code for this tutorial is designed to run on Python 3.5, and PyTorch 0.4. It can be found in it’s entirety at this Github repo.
本教程使用的代码需要运行在 Python 3.5 和 PyTorch 0.4 版本之上。你可以在以下链接中找到所有代码：
https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch
https://github.com/ayooshkathuria/pytorch-yolo-v3

This tutorial is broken into 5 parts: (本教程包含五个部分：)
Part 1 (This one): Understanding How YOLO works (1. YOLO 的工作原理)
Part 2 : Creating the layers of the network architecture (2. 创建 YOLO 网络层)
Part 3 : Implementing the the forward pass of the network (3. 实现网络的前向传播)
Part 4 : Objectness score thresholding and Non-maximum suppression (4. objectness 置信度阈值和非极大值抑制)
Part 5 : Designing the input and the output pipelines (5. 设计输入和输出管道)

1. Prerequisites (所需背景知识)

You should understand how convolutional neural networks work. This also includes knowledge of Residual Blocks, skip connections, and Upsampling.
在学习本教程之前，你需要了解：卷积神经网络的工作原理，包括残差块、跳过连接和上采样。

What is object detection, bounding box regression, IoU and non-maximum suppression.
目标检测、边界框回归、IoU 和非极大值抑制。

Basic PyTorch usage. You should be able to create simple neural networks with ease.
基础的 PyTorch 使用。你需要能够轻松地创建简单的神经网络。

I’ve provided the link at the end of the post in case you fall short on any front.
我在帖子末尾提供了链接，以防您遇到任何困难。

2. What is YOLO? (什么是 YOLO？)

YOLO stands for You Only Look Once. It’s an object detector that uses features learned by a deep convolutional neural network to detect an object. Before we get out hands dirty with code, we must understand how YOLO works.
YOLO 是 You Only Look Once 的缩写。它是一种使用深度卷积神经网络学得的特征来检测目标的检测器。在我们上手写代码之前，我们必须先了解 YOLO 的工作原理。

2.1 A Fully Convolutional Neural Network (全卷积神经网络)

YOLO makes use of only convolutional layers, making it a fully convolutional network (FCN). It has 75 convolutional layers, with skip connections and upsampling layers. No form of pooling is used, and a convolutional layer with stride 2 is used to downsample the feature maps. This helps in preventing loss of low-level features often attributed to pooling.
YOLO 仅使用卷积层，这就使其成为全卷积神经网络 (FCN)。它拥有 75 个卷积层，还有跳过连接和上采样层。它不使用任何形式的池化，使用步幅为 2 的卷积层对特征图进行下采样。这有助于防止通常由池化导致的低级特征丢失。

attribute [(for n.) ˈatrɪbjuːt; (for v.) əˈtrɪbjuːt]：n. 属性，特质 vt. 归属，把...归于
amongst [ə'mʌŋst]：prep. 在...当中 (等于 among)
stick [stɪk]：vt. 刺，戳，伸出，粘贴 vi. 坚持，伸出，粘住 n. 棍，手杖，呆头呆脑的人
boost [buːst]：vt. 促进，增加，支援 vi. 宣扬，偷窃 n. 推动，帮助，宣扬
dirty [ˈdɜːti]：adj. 下流的，卑鄙的，肮脏的，恶劣的，暗淡的 vt. 弄脏 vi. 变脏

Being a FCN, YOLO is invariant to the size of the input image. However, in practice, we might want to stick to a constant input size due to various problems that only show their heads when we are implementing the algorithm.
作为 FCN，YOLO 对于输入图像的大小并不敏感。然而，在实践中，我们可能想要持续不变的输入大小，因为各种问题只有在我们实现算法时才会浮现出来。

A big one amongst these problems is that if we want to process our images in batches (images in batches can be processed in parallel by the GPU, leading to speed boosts), we need to have all images of fixed height and width. This is needed to concatenate multiple images into a large batch (concatenating many PyTorch tensors into one)
这其中的一个重要问题是：如果我们希望按批次处理图像 (批量图像由 GPU 并行处理，这样可以提升速度)，我们就需要固定所有图像的高度和宽度。这就需要将多个图像整合进一个大的批次 (将许多 PyTorch 张量合并成一个)。

The network downsamples the image by a factor called the stride of the network. For example, if the stride of the network is 32, then an input image of size $416 \times 416$ will yield an output of size $13 \times 13$ . Generally, stride of any layer in the network is equal to the factor by which the output of the layer is smaller than the input image to the network.
YOLO 通过步幅对图像进行下采样。例如，如果网络的步幅是 32，则大小为 $416 \times 416$ 的输入图像将产生 $13 \times 13$ 的输出。通常，网络中任意层的步幅等于 the factor。网络中任何层的步幅等于该层的输出小于网络输入图像的因数。

$416 / 32 = 416 / 2^5 = 13$

2.2 Interpreting the output (解释输出)

Typically, (as is the case for all object detectors) the features learned by the convolutional layers are passed onto a classifier/regressor which makes the detection prediction (coordinates of the bounding boxes, the class label… etc).
典型地 (对于所有目标检测器都是这种情况)，卷积层所学习的特征会被传递到分类器/回归器，从而进行预测 (边界框的坐标、类别标签等)。

In YOLO, the prediction is done by using a convolutional layer which uses $\times 1$ convolutions.
在 YOLO 中，预测是通过卷积层完成的 (它是一个全卷积神经网络，请记住！)。

Now, the first thing to notice is our output is a feature map. Since we have used 1 $\times$ 1 convolutions, the size of the prediction map is exactly the size of the feature map before it. In YOLO v3 (and it’s descendants), the way you interpret this prediction map is that each cell can predict a fixed number of bounding boxes.
现在，首先要注意的是我们的输出是一个特征图。由于我们使用了 1 $\times$ 1 的卷积，所以 prediction map 的大小恰好是之前特征图的大小。在 YOLO v3 (及其更新的版本) 上，prediction map 就是每个可以预测固定数量边界框的单元格。

descendant [dɪˈsendənt]：n. 后裔，子孙，(由过去类似物发展来的) 派生物，(机器等) 后继型产品 adj. 下降的；祖传的

Though the technically correct term to describe a unit in the feature map would be a neuron, calling it a cell makes it more intuitive in our context.
虽然形容特征图中单元的正确术语应该是神经元，但本文中为了更为直观，我们将其称为单元格 (cell)。

Depth-wise, we have ( $\times (5 + C)$ ) entries in the feature map. $B$ represents the number of bounding boxes each cell can predict. According to the paper, each of these B bounding boxes may specialize in detecting a certain kind of object. Each of the bounding boxes have $5 + C$ attributes, which describe the center coordinates, the dimensions, the objectness score and C class confidences for each bounding box. YOLO v3 predicts 3 bounding boxes for every cell.
深度方面，特征图中有 ( $\times (5 + C)$ ) 个条目。 $B$ 代表每个 cell 可以预测的边界框数量。根据 YOLO 的论文，这些 $B$ 边界框中的每一个都可能专门用于检测某种目标。每个边界框都有 $5 + C$ 个属性，分别描述每个边界框的中心坐标、维度、objectness 分数和 $C$ 类置信度。YOLO v3 在每个单元中预测 3 个边界框。

You expect each cell of the feature map to predict an object through one of it’s bounding boxes if the center of the object falls in the receptive field of that cell. (Receptive field is the region of the input image visible to the cell. Refer to the link on convolutional neural networks for further clarification).
如果目标的中心位于单元格的感受野内，你会希望特征图的每个单元格都可以通过其中一个边界框预测目标。(感受野是输入图像对于单元格可见的区域。)

This has to do with how YOLO is trained, where only one bounding box is responsible for detecting any given object. First, we must ascertain which of the cells this bounding box belongs to.
这与 YOLO 是如何训练的有关，只有一个边界框负责检测任意给定目标。首先，我们必须确定这个边界框属于哪个单元格。

To do that, we divide the input image into a grid of dimensions equal to that of the final feature map.
因此，我们需要切分输入图像，把它拆成维度等于最终特征图的网格。

Let us consider an example below, where the input image is 416 $\times$ 416, and stride of the network is 32. As pointed earlier, the dimensions of the feature map will be 13 $\times$ 13. We then divide the input image into 13 $\times$ 13 cells.
让我们思考下面一个例子，其中输入图像大小是 416 $\times$ 416，网络的步幅是 32。如之前所述，特征图的维度会是 13 $\times$ 13。随后，我们将输入图像分为 13 $\times$ 13 个网格。

clarification [ˌklærəfɪˈkeɪʃn]：n. 澄清，说明，净化
ascertain [ˌæsəˈteɪn]：vt. 确定，查明，探知

在这里插入图片描述

Then, the cell (on the input image) containing the center of the ground truth box of an object is chosen to be the one responsible for predicting the object. In the image, it is the cell which marked red, which contains the center of the ground truth box (marked yellow).
输入图像中包含了 ground truth 框的中心的网格会作为负责预测目标的单元格。在图像中，它是被标记为红色的单元格，其中包含了 ground truth 框的中心 (被标记为黄色)。

Now, the red cell is the 7th cell in the 7th row on the grid. We now assign the 7th cell in the 7th row on the feature map (corresponding cell on the feature map) as the one responsible for detecting the dog.
现在，红色单元格是网格中第七行的第七个。我们现在使用特征图中第七行的第七个单元格 (特征图中的对应单元格) 作为检测狗的单元格。

Now, this cell can predict three bounding boxes. Which one will be assigned to the dog’s ground truth label? In order to understand that, we must wrap out head around the concept of anchors.
现在，这个单元格可以预测三个边界框。哪个将会分配给狗的 ground truth 标签？为了理解这一点，我们必须理解 anchors 的概念。

Note that the cell we’re talking about here is a cell on the prediction feature map. We divide the input image into a grid just to determine which cell of the prediction feature map is responsible for prediction
请注意，我们在这里讨论的单元格是预测特征图上的单元格，我们将输入图像分隔成网格，以确定预测特征图的哪个单元格负责预测目标。

2.3 Anchor Boxes (锚点框)

It might make sense to predict the width and the height of the bounding box, but in practice, that leads to unstable gradients during training. Instead, most of the modern object detectors predict log-space transforms, or simply offsets to pre-defined default bounding boxes called anchors.
预测边界框的宽度和高度看起来非常合理，但在实践中，训练会带来不稳定的梯度。所以，现在大部分目标检测器都是预测对数空间 (log-space) 变换，或者预测与预先定义的默认边界框 (anchors) 之间的偏移。

Then, these transforms are applied to the anchor boxes to obtain the prediction. YOLO v3 has three anchors, which result in prediction of three bounding boxes per cell.
然后，这些变换被应用到锚点框 (anchor boxes) 来获得预测。YOLO v3 有三个锚点，所以每个单元格会预测 3 个边界框。

Coming back to our earlier question, the bounding box responsible for detecting the dog will be the one whose anchor has the highest IoU with the ground truth box.
回到前面的问题，负责检测狗的边界框的 anchor 有与 ground truth box 最高的 IoU。

2.4 Making Predictions (预测)

The following formulae describe how the network output is transformed to obtain bounding box predictions.
下面的公式描述了网络输出是如何转换以获得边界框预测结果的。

在这里插入图片描述

$b_{x}, b_{y}, b_{w}, b_{h}$ are the $x, y$ center coordinates, width and height of our prediction. $t_{x}, t_{y}, t_{w}, t_{h}$ is what the network outputs. $c_{x}$ and $c_{y}$ are the top-left coordinates of the grid. $p_{w}$ and $p_{h}$ are anchors dimensions for the box.

2.5 Center Coordinates (中心坐标)

Notice we are running our center coordinates prediction through a sigmoid function. This forces the value of the output to be between 0 and 1. Why should this be the case? Bear with me.
注意：我们使用 sigmoid 函数进行中心坐标预测。这使得输出值在 0 和 1 之间。原因如下：

bear [ber]：v. 承受，忍受，不适于某事 (或做某事)，承担责任 n. 熊，(在证券市场等) 卖空的人
wrap [ræp]：vt. 包，缠绕，隐藏，掩护 vi. 包起来，缠绕，穿外衣 n. 外套，围巾

Normally, YOLO doesn’t predict the absolute coordinates of the bounding box’s center. It predicts offsets which are:
正常情况下，YOLO 不会预测边界框中心的确切坐标。它预测偏移量：

Relative to the top left corner of the grid cell which is predicting the object. (相对于预测目标的网格单元的左上角。)
Normalised by the dimensions of the cell from the feature map, which is, 1. (通过特征图中的单元格维度进行归一化，即 1。)

For example, consider the case of our dog image. If the prediction for center is (0.4, 0.7), then this means that the center lies at (6.4, 6.7) on the 13 x 13 feature map. (Since the top-left co-ordinates of the red cell are (6, 6)).
以我们的图像为例。如果中心的预测是 (0.4, 0.7)，则中心在 13 $\times$ 13 特征图上的坐标是 (6.4, 6.7) (红色单元的左上角坐标是 (6, 6))。

But wait, what happens if the predicted $x, y$ coordinates are greater than one, say (1.2, 0.7). This means center lies at (7.2, 6.7). Notice the center now lies in cell just right to our red cell, or the 8th cell in the 7th row. This breaks theory behind YOLO because if we postulate that the red box is responsible for predicting the dog, the center of the dog must lie in the red cell, and not in the one beside it.
但是，如果预测到的 $x, y$ 坐标大于 1，比如 (1.2, 0.7)。那么中心坐标是 (7.2, 6.7)。注意该中心在红色单元右侧的单元中，或第 7 行的第 8 个单元。这打破了 YOLO 背后的理论，因为如果我们假设红色框负责预测目标狗，那么狗的中心必须在红色单元中，不应该在它旁边的网格单元中。

Therefore, to remedy this problem, the output is passed through a sigmoid function, which squashes the output in a range from 0 to 1, effectively keeping the center in the grid which is predicting.
因此，为了解决这个问题，我们对输出执行 sigmoid 函数，将输出压缩到区间 0 到 1 之间，有效确保中心处于执行预测的网格单元中。

postulate ['pɒstjʊleɪt]：vt. 假定，要求，视...为理所当然 n. 基本条件，假定
squash [skwɒʃ]：n. (软式) 墙网球，壁球，果汁饮料 v. 压软 (或挤软、压坏、压扁等)，把...压 (或挤) 变形，(使) 挤进，塞入

2.6 Dimensions of the Bounding Box (边界框的维度)

The dimensions of the bounding box are predicted by applying a log-space transform to the output and then multiplying with an anchor.
我们对输出执行对数空间变换，然后乘以 anchor，来预测边界框的维度。

在这里插入图片描述
How the detector output is transformed to give the final prediction. Image Credits. http://christopher5106.github.io/
检测器输出在最终预测之前的变换过程。

credit [ˈkredɪt]：n. 信用，信誉，贷款，学分，信任，声望 vt. 相信，信任，把...归给，归功于，赞颂

The resultant predictions, $b_{w}$ and $b_{h}$ , are normalised by the height and width of the image. (Training labels are chosen this way). So, if the predictions $b_{x}$ and $b_{y}$ for the box containing the dog are (0.3, 0.8), then the actual width and height on 13 $\times$ 13 feature map is (13 $\times$ 0.3, 13 $\times$ 0.8).
得出的预测 $b_{w}$ 和 $b_{h}$ 使用图像的高和宽进行归一化。如果包含目标 (狗) 的框的预测 $b_{x}$ 和 $b_{y}$ 是 (0.3, 0.8)，那么 13 $\times$ 13 特征图的实际宽和高是 (13 $\times$ 0.3, 13 $\times$ 0.8)。

2.7 Objectness Score (Objectness 分数)

Object score represents the probability that an object is contained inside a bounding box. It should be nearly 1 for the red and the neighboring grids, whereas almost 0 for, say, the grid at the corners.
Object 分数表示目标在边界框内的概率。红色网格和相邻网格的 Object 分数应该接近 1，而角落处的网格的 Object 分数可能接近 0。

The objectness score is also passed through a sigmoid, as it is to be interpreted as a probability.
objectness 分数的计算也使用 sigmoid 函数，因此它可以被理解为概率。

2.8 Class Confidences (类别置信度)

Class confidences represent the probabilities of the detected object belonging to a particular class (Dog, cat, banana, car etc). Before v3, YOLO used to softmax the class scores.
类别置信度表示检测到的目标属于某个类别的概率 (如狗、猫、香蕉、汽车等)。在 v3 之前，YOLO 需要对类别分数执行 softmax 函数操作。

However, that design choice has been dropped in v3, and authors have opted for using sigmoid instead. The reason is that Softmaxing class scores assume that the classes are mutually exclusive. In simple words, if an object belongs to one class, then it’s guaranteed it cannot belong to another class. This is true for COCO database on which we will base our detector.
但是，YOLO v3 舍弃了这种设计，作者选择使用 sigmoid 函数。因为对类别分数执行 softmax 操作的前提是类别是互斥的。简言之，如果目标属于一个类别，那么必须确保其不属于另一个类别。这在我们设置检测器的 COCO 数据集上是正确的。

However, this assumptions may not hold when we have classes like Women and Person. This is the reason that authors have steered clear of using a Softmax activation.
但是，当出现类别女性 (Women) 和人 (Person) 时，该假设不可行。这就是作者选择不使用 Softmax 激活函数的原因。

mutually [ˈmjuːtʃuəli]：adv. 互相地，互助
opt [ɒpt]：vi. 选择
exclusive [ɪkˈskluːsɪv]：adj. 独有的，排外的，专一的 n. 独家新闻，独家经营的项目，排外者
drop [drɒp]：v. 推动，帮助，宣扬，下降，终止 n. 滴，落下，空投，微量，滴剂

2.9 Prediction across different scales. (在不同尺度上的预测)

YOLO v3 makes prediction across 3 different scales. The detection layer is used make detection at feature maps of three different sizes, having strides 32, 16, 8 respectively. This means, with an input of 416 $\times$ 416, we make detections on scales 13 $\times$ 13, 26 $\times$ 26 and 52 $\times$ 52.
YOLO v3 在 3 个不同尺度上进行预测。检测层用于在三个不同大小的特征图上执行预测，特征图步幅分别是 32、16、8。这意味着，当输入图像大小是 416 $\times$ 416 时，我们在尺度 13 $\times$ 13、26 $\times$ 26 和 52 $\times$ 52 上执行检测。

The network downsamples the input image until the first detection layer, where a detection is made using feature maps of a layer with stride 32. Further, layers are upsampled by a factor of 2 and concatenated with feature maps of a previous layers having identical feature map sizes. Another detection is now made at layer with stride 16. The same upsampling procedure is repeated, and a final detection is made at the layer of stride 8.
该网络在第一个检测层之前对输入图像执行下采样，检测层使用步幅为 32 的层的特征图执行检测。随后在执行因子为 2 的上采样后，并与前一个层的特征图 (特征图大小相同) 拼接。另一个检测在步幅为 16 的层中执行。重复同样的上采样步骤，最后一个检测在步幅为 8 的层中执行。

At each scale, each cell predicts 3 bounding boxes using 3 anchors, making the total number of anchors used 9. (The anchors are different for different scales)
在每个尺度上，每个单元格使用 3 个 anchors 预测 3 个边界框，anchors 的总数为 9 (不同尺度的 anchors 不同)。

在这里插入图片描述

The authors report that this helps YOLO v3 get better at detecting small objects, a frequent complaint with the earlier versions of YOLO. Upsampling can help the network learn fine-grained features which are instrumental for detecting small objects.
作者称这帮助 YOLO v3 在检测较小目标时取得更好的性能，而这正是 YOLO 之前版本经常被抱怨的地方。上采样可以帮助该网络学习细粒度特征，帮助检测较小目标。

instrumental [ˌɪnstrəˈmentl]：adj. 乐器的，有帮助的，仪器的，器械的 n. 器乐曲，工具字，工具格

2.10 Output Processing (输出处理)

For an image of size 416 $\times$ 416, YOLO predicts ((52 $\times$ 52) + (26 $\times$ 26) + 13 $\times$ 13)) x 3 = 10647 bounding boxes. However, in case of our image, there’s only one object, a dog. How do we reduce the detections from 10647 to 1?
对于大小为 416 $\times$ 416 的图像，YOLO 预测 ((52 $\times$ 52) + (26 $\times$ 26) + 13 $\times$ 13)) $\times$ 3 = 10647 个边界框。但是，我们的示例中只有一个目标 - 一只狗。那么我们怎么才能将检测次数从 10647 减少到 1 呢？

Thresholding by Object Confidence (目标置信度阈值)
First, we filter boxes based on their objectness score. Generally, boxes having scores below a threshold are ignored.
首先，我们根据它们的 objectness 分数过滤边界框。通常，分数低于阈值的边界框会被忽略。

Non-maximum Suppression (非极大值抑制)
NMS intends to cure the problem of multiple detections of the same image. For example, all the 3 bounding boxes of the red grid cell may detect a box or the adjacent cells may detect the same object.
非极大值抑制 (NMS) 可解决对同一个图像的多次检测的问题。例如，红色网格单元的 3 个边界框可以检测一个框，或者临近网格可检测相同目标。

cure [kjʊə(r)]：n. 治疗，对策，药，疗法 v. 治愈，解决，矫正，治好 (疾病)

在这里插入图片描述

If you don’t know about NMS, I’ve provided a link to a website explaining the same.

3. Our Implementation (实现)

YOLO can only detect objects belonging to the classes present in the dataset used to train the network. We will be using the official weight file for our detector. These weights have been obtained by training the network on COCO dataset, and therefore we can detect 80 object categories.
YOLO 只能检测出属于训练所用数据集中类别的目标。我们的检测器将使用官方权重文件，这些权重通过在 COCO 数据集上训练网络而获得，因此我们可以检测 80 个目标类别。

That’s it for the first part. This post explains enough about the YOLO algorithm to enable you to implement the detector. However, if you want to dig deep into how YOLO works, how it’s trained and how it performs compared to other detectors, you can read the original papers, the links of which I’ve provided below.
该教程的第一部分到此结束。这部分详细讲解了 YOLO 算法。如果你想深度了解 YOLO 的工作原理、训练过程和与其他检测器的性能规避，可阅读原始论文。

That’s it for this part. In the next part, we implement various layers required to put together the detector.
这部分就是这样。在下一部分中，我们实现将检测器组装在一起所需的各个层。
https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch-part-2/

4. Further Reading

YOLO V1: You Only Look Once: Unified, Real-Time Object Detection
YOLO V2: YOLO9000: Better, Faster, Stronger
YOLO V3: An Incremental Improvement
Convolutional Neural Networks
http://cs231n.github.io/convolutional-networks/
Bounding Box Regression (Appendix C)
Rich feature hierarchies for accurate object detection and semantic segmentation
IoU
https://www.youtube.com/watch?v=DNEm4fJ-rto
Non maximum suppression
https://www.youtube.com/watch?v=A46HZGR5fMw
PyTorch Official Tutorial
https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

Ayoosh Kathuria is currently an intern at the Defense Research and Development Organization, India, where he is working on improving object detection in grainy videos. When he’s not working, he’s either sleeping or playing pink floyd on his guitar. You can connect with him on LinkedIn or look at more of what he does at GitHub.
https://github.com/ayooshkathuria

intern ['ɪntɜːn]：n. 实习生，实习医师 vt. 拘留，软禁 vi. 作实习医师
grainy [ˈɡreɪni]：adj. 粒状的，木纹状的，多粒的，有纹理的
guitar [ɡɪˈtɑː(r)]：n. 吉他，六弦琴 vi. 弹吉他

Image Credits: Karol Majek. Check out his YOLO v3 real time detection video here
https://www.youtube.com/watch?v=8jfscFuP9k

References

https://blog.paperspace.com/
https://blog.paperspace.com/tag/series-yolo/
https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/
http://christopher5106.github.io/
http://christopher5106.github.io/object/detectors/2017/08/10/bounding-box-object-detectors-understanding-yolo.html
https://www.jiqizhixin.com
https://github.com/ayooshkathuria

Yongqiang Cheng

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 1

How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 1从零开始 PyTorch 项目：YOLO v3 目标检测实现 (第一部分)https://blog.paperspace.com/https://blog.paperspace.com/tag/series-yolo/https://blo...
复制链接

扫一扫