画风清奇的YOLOv3（检测“江湖”的吸星大法‍。。。）

最新推荐文章于 2024-04-25 14:49:57 发布

LouisD5Luka

最新推荐文章于 2024-04-25 14:49:57 发布

阅读量503

点赞数

文章标签：计算机视觉 cv 深度学习

本文链接：https://blog.csdn.net/LouisD5Luka/article/details/103588626

版权

用YOLO学习大法学习YOLO（Never mind it。。。）

2019.12.17——2019.12.20
————本文在假设已有深度学习基本概念下展开

YOLO

YOLO，全称“You only look once”（说了是吸星大法，当然只能看一眼😂），区别于网上有些的“You only live once”（哈哈哈，谁不是之只能活一次，个人认为对objection detection的扣题来说，还是看一次比较合适。😂）。

以下讨论的是version 3，即YOLOv3的细节：（出差前往目的地（这就出国了，呵呵。。。）…此处经过若干天）

OK，经过几天的求知，我回来了，anyway，先抛出几个细节Error：

出现重名变量write，注意先定义后调用以及变量覆盖问题，否则会出现int类型uncallable报错。
分割地址标志应注意‘ \ ’与’ //'区别。
函数中用的cv2.resize应换成letter_box_image函数，否则至少会出现预测框不准且有预测框偏低或过大现象。

其他的就没有什么了，下面进行YOLOv3的化整为零。
简单介绍下v1和v2。(在此感谢这位知乎老哥对v1和v2的一些分析，并附上HighwayYOLO系列——年轻人都在用的detector)

YOLOv1(CVPR 2016)

巨人的左肩：
You Only Look Once: Unified, Real-Time Object Detection

输入图像分成若干个网格，如32 x 32。
图中物体中心所在网格负责对该物体进行预测。
每个网格预测若干个Bounding Box及其Confidence（其可由有无物体{0， 1} 与预测框和真实框的IoU的乘积表示。）。
每个Bounding Box有 5 个预测值，在所在网格的中心坐标x, y，框宽度，高度w, h，confidence则为预测框的置信度。
同时每个网格预测属于每个类别的条件概率（即已知网格有目标的情况下，每一类的目标的概率是多少。）。
网络结构如下：

在激活函数方面，除了最后一层使用线性激活函数，其余层皆使用Leaky ReLU，如下：
$\phi(x)= \begin{cases} x,&\text{if x > 0}\\ 0.1x,& \text{otherwise} \end{cases}$

损失函数方面使用SSE（Sum-Squared Error，平方和误差，说是因为好算喔），当中包含：坐标预测误差、IoU误差、分类误差。，由于生成的框中含有目标的占少数，会导致，无目标框，即误差接近零的部分对总误差影响较大，故对其二者进行权重分配，有目标的框误差权重为 $\lambda_{coord}=5$ ，无目标的框权重为 $\lambda_{noobj}= 0.5$ 。

综上，损失函数具体形式为：
$\begin{aligned} \mathbb{L} =& \lambda_{coord}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}\mathbb{I}_{ij}^{obj}[(x_{i}-\hat{x}_{i})^{2}+(y_{i}-\hat{y}_{i})^{2}]\\ &+\lambda_{coord}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}\mathbb{I}_{ij}^{obj}[(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}]\\ &+\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}\mathbb{I}_{ij}^{obj}(C_{i}-\hat{C}_{i})^{2}\\ &+\lambda_{noobj}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}\mathbb{I}_{ij}^{noobj}[(C_{i}-\hat{C}_{i})^{2}]\\ &+\sum_{i=0}^{S^{2}}\mathbb{I}_{i}^{obj}\sum_{c\in classes}(p_{i}(c)-\hat{p}_{i}(c))^{2} \end{aligned}$
（手打（牛肉）完，舒服！）其中 $\mathbb{I}_{ij}^{obj}=\{1, 0\}$ 判断第 $i$ 个网格的第 $j$ 个 Box是否包含物体， $\mathbb{I}_{i}^{obj}$ 表示包含物体。

Pros and Cons

Pros:

天下武功为快不破（它当时比别人快啊。。。）；One-stage流派的开山鼻祖，在此之前都是Region Proposal + 分类网络的Two-stage流派（R-CNN系列；在v1提出不久后，Faster R-CNN横空出世，提出了Region Proposal Network，打通了训练的任督二脉，从而实现了端到端训练。），分段进行，还是慢不少的。
由于RP或滑窗机制只能得到图像局部信息，但YOLO可兼容上下文信息，故能很好避免背景错误。
泛化特征学得好。

Cons:

速度与精度的Trade off，精度不如当时其他SOTA模型。
容易产生定位错误。
小物体检测精度不佳。（稍后会通过多层特征融合的方法克服。）

YOLOv2(CVPR 2017)

巨人的右肩：
YOLO9000: Better, Faster, Stronger

YOLO9000，又称YOLOv2，肩如其名，更高，更快，更强！

亮点：

吸星大法第一式：Anchor（特别鸣谢Faster R-CNN），并使用K-Means对Anchor外形尺寸进行聚类分析，是为了一开始就提供更好的先验框大小；由于v1通过全连接层完成边框预测，会导致空间信息丢失较多，定位不准，这时候就需要RPN 生成的Anchor来增进内功了。（RPN、Anchor秘籍详见：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks）。
使用全卷积架构取缔全连接层。
训练时引进世界树（World Tree）结构，将检测、分类框架进行统一，并提出联合训练，即在Imagnet分类数据集及COCO检测数据集上同时进行训练；根据两个数据集之间的类别关系，建立“树结构”，请品如下（你怎么穿品如的衣服？？？）：

找到所求类别所在位置，遍历路径经过节点的概率之积即为其预测概率。

在训练时，如果是检测样本，按照YOLOv2的loss计算误差，而对于分类样本，只计算分类误差。在预测时，YOLOv2给出的置信度就是，同时会给出边界框位置以及一个树状概率图。在这个概率图中找到概率最高的路径，当达到某一个阈值时停止，就用当前节点表示预测的类别。（复制了复制了。。。（就这是！！此处不存在类比推理，3q））。

在卷积层后引入Bactch Normalization层，加快收敛速度，减少对其他减缓过拟合方法的依赖（如:Droupout），mAP提升2%。
相比于v1在224 x 224的图像上预训练后，在448 x 448上进行正式训练的“含蓄”，v2则更直接了当，在448 x 448的图像上进行训练，分辨率的提升使得mAP提高4%。
吸星大法第二式：多层次特征图提取不同分辨率的特征（特别鸣谢Faster R-CNN、SSD）；通过叠加高分辨率的浅层特征图相邻特征到不同通道（非空间位置），类似ResNet中的“Shortcut”，通过卷积操作实现。
多尺度输入进行训练以提高鲁棒性。
Backbone:VGG-16居多，ImageNet达90%的Top-5（最后预测概率向量最大的前五名中出现了正确概率即为预测正确），但单张图片需要30.69 billion浮点运算，对比之下，以DarkNet-19为基底的结构则能在相同数据集达到91%的Top-5的情况下，只需要单张图片5.58 billion的浮点运算，提速效果可想而知。

Every coin has two sides

v2比v1改动了不少，效果也好不少（从上面看来的话，是这样的。），但也有存在的问题，就是对重叠分类并没有什么很好的解决方法。

YOLOv3

吸星大法攒（cuán）够了内力之后，大魔王YOLOv3就天神下凡了。（哈哈哈哈哈哈哈！）

首先捋一下YOLOv3实现目标检测的原理和过程，继承于了v1（题外话：本文RBG大神也有份参与，你应该知道他是谁。。。）和v2，如题。

细节

令网络相关坐标输出为： $t x, t y, t w, t h$ ，则建立如下关系：
$\begin{aligned} b_{x}&=\sigma(t_{x})+c_{x}\\ b_{y}&=\sigma(t_{y})+c_{y}\\ b_{w}&=p_{w}e^{t_{w}}\\ b_{h}&=p_{h}e^{t_{h}} \end{aligned}$
其中 $b_{x},b_{y},b_{w},b_{h}$ 为 $x, y$ 的中心坐标，和预测的宽度和高度。 $c_{x},c_{y}$ 为网格左上角的坐标。 $p_{w},p_{h}$ 为框中锚点的维度。（很多东西只是因为名字奇怪才会显得难理解，譬如锚点。。。）
仍旧使用SSE（平法和误差损失）：令预测框坐标为 $\hat{t_{*}}$ ，Ground Truth为 $t_{*}$ ，则坐标梯度计算为： $\hat{t_{*}}-t_{*}$ 。
逻辑回归预测每个Bounding Box的目标分数：Bounding Box与Ground Truth重叠面积超过其他任意BB与GT的重叠面积时应为1；当非最佳但超过某阈值时忽略此框（理解为与最佳框检测同一目标，但与GT接近程度并非最佳，故不纳入考虑）。在此使用的Threshold（阈值）为 0.5。
与Faster R-CNN相比，YOLOv3每个Ground Truth只指定一个Bounding Box；未被指定的Bounding Box将只对目标得分产生误差，而对坐标、分类不产生误差。
此外，YOLO中的网络结构还用到了类似（akin to）ResNet中Skip Connection的Shortcut结构。
每个Box的分类使用的是多标签分类（相比起对效果无改进的Softmax，选择了Independent Logistic Classifier）；在训练期间，分类预测使用的是Binary Cross-Entropy Loss（nn.BCE()）；Softmax面对的是二分类问题，当预测框中只有一个目标时适用，但面对具有许多重叠标签（原文举的例子：女人（Woman）、人（Person））的数据集时就显得无能为力了。
使用与Feature Pyramid Network相似的概念进行不同层的特征进行提取，且预测 3 种不同规格的目标框。对两层前的特征图进行2 $\times$ 上采样，同时，取更浅层的特征图，并对两者用 Concatenation进行拼接。通过这个方法可以从上采样特征中获得更有意义的分割信息和较浅层的更好细粒度（Finer-grained）的信息。紧接着对拼接特征继续进行若干次卷积操作。
这里也用了K-Means对数据集中的Bounding Box先验进行聚类，最后得出以下 9 种框：(10×13), (16×30), (33×23), (30×61), (62×45), (59× 119), (116 × 90), (156 × 198), (373 × 326)（复制复制。。。）。

上网络架构 DarkNet-53：

原文中提到了DarkNet-53比DarkNet-19强，比ResNet-101、ResNet-152有效，po图吧，无图无真相：

文中还提到了，ResNet可能由于层数太多了，所以没他们的模型有效。同时BFLOP/s(Billions Floating Point Operation per second)也体现了该网络结构更好地利用了GPU。（讲完了，这么一看确实没啥新鲜的，换了个Backbone，参考下别人的多层次特征融合，直观感觉上不如v2来得那么震撼，原文来讲的话，还是看看v2吧。）

接下来看看更详细的对比结果：

。。。
（这老哥直言不讳自己“偷图”，还贱兮兮地说人家作了贼——久的图，原话：I’m seriously just stealing all these tables from [9] they take soooo long to make from scratch.）

建议不开心的时候可以看看YOLOv3，说不定看完就开心了。

由图，光从效果来看的话，和RetinaNet还是有明显差距的，不管是大中小物体都好；而能够媲美的是SSD513，在IoU阈值为 .75下，效果远好于SSD，小物体的检测与之相比也好得多，但在大物体的检测上略显不佳。

那既然它都不如RetinaNet，那为什么还要选择它呢？上图：

。。。
（说实话，我真的怀疑他是不是来搞笑的，竟然把自己模型画在第二象限，颇有08年北京奥运会博尔特的意思啊😂冲刺前还回头望月。）

显然吧，速度快太多，太明显了。

论文主体分析到这里，接下来上代码， Talk is cheap（废话少说。。。）：
首先，备用的自定义函数库：

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
import cv2 


# 编写Unique函数（随心所欲大法。。。）以获得任何给定图像的给出的类别（真够拗口的。。。）
def unique(tensor):
    tensor_np = tensor.cpu().numpy()
    # 去重
    unique_np = np.unique(tensor_np)
    unique_tensor = torch.from_numpy(unique_np)
    
    tensor_res = tensor.new(unique_tensor.shape)
    # 区别于.clone()。clone()不仅拷贝了原始的value，还会计算梯度传播信息，copy_()只拷贝数值。
    tensor_res.copy_(unique_tensor)
    
    return tensor_res

# 编写计算IoU的函数。
def bbox_iou(box1, box2):
    b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3]
    b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3]
    
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    # +1 的理由可理解为：根据坐标计算距离，类似于队列插空。
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0)*torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)
    
    b1_area = (b1_x2 - b1_x1 + 1)*(b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1)*(b2_y2 - b2_y1 + 1)
    
    iou = inter_area / (b1_area + b2_area - inter_area)
    
    return iou

def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA=True):
    # Prediction为Tensor，维度为[batch_size, channel, height, width]
    batch_size = prediction.size(0)
    # 输出中每个grid所包含的代表输入图像的像素数（又拗口。。。）
    stride = inp_dim // prediction.size(2)
    # 反向操作可得网格数（其实就是输出特征图维度，画网格的时候就是这么商量的。。。）
    grid_size = inp_dim // stride
    # 四个坐标，一个目标得分，还有所有分类的概率。
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)
    
    # Tensor.view(*shape) 重置维度为shape
    prediction = prediction.view(batch_size, bbox_attrs*num_anchors, grid_size*grid_size)
    prediction = prediction.transpose(1, 2).contiguous()
    prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attrs)
    anchors = [(a[0]/stride, a[1]/stride) for a in anchors]
    
# anchors = [(a[0]/stride, a[1]/stride) for a in anchors]
    # Sigmoid激活
    # 中心X坐标
    prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
    # 中心Y坐标
    prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
    # 目标得分
    prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])
    
    # 增加中心平移（offset)
    grid = np.arange(grid_size)
    a, b = np.meshgrid(grid, grid)
    # 生成与网格数量相关的空间。
    # .view(-1, 1)：按行拉直成一列
    x_offset = torch.FloatTensor(a).view(-1, 1)
    y_offset = torch.FloatTensor(b).view(-1, 1)
    
    if CUDA:
        x_offset = x_offset.cuda()
        y_offset = y_offset.cuda()
    # torch.cat((a, b), 1)：1为按列连接，0为按行。
    # .repeat(*size)：重复若干次单元，分行列。如.repeat(1, 2)，重复为1行2列。 
    # 一顿操作猛如虎，定睛一看！看不懂。。。
    # .repeat()可理解为每个anchor都要进行平移。
    # .unsqueeze()是为了保持维度一致而增加了第一个维度。
    x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1, num_anchors).view(-1, 2).unsqueeze(0)
    # 中心坐标X,Y加上平移量。
    prediction[:,:,:2] += x_y_offset
    
    anchors = torch.FloatTensor(anchors)
    
    # 高度和宽度的log转换。
#     anchors = torch.FloatTensor(anchors)
    
    if CUDA:
        anchors = anchors.cuda()
    anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0)
    prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors

    prediction[:,:,5:5 + num_classes] = torch.sigmoid((prediction[:,:,5:5 + num_classes]))
    prediction[:,:,:4] *= stride
                   
    return prediction

# 预测完还没算，这都还没到成熟期，只是生成了一大堆的框，还得挑选一下。。。我选CDG的买菜包
# 具体实现通过目标得分（Object score）和非极大值抑制（又一不明觉厉物体出现。。。）
# 不好意思，是一大波不明觉厉正在路上。。。其中包含目标得分阈值、NMS中的IoU（Intersection over Union，中文人称预测窗口和标签窗口的“交叠率”）阈值。。。
def write_results(prediction, confidence, num_classes, nms_conf=0.4):
    # 只提取目标得分大于置信度的bounding boxes
    conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2)
    prediction = prediction*conf_mask

write_results首先进行的是Object Confidence的筛选，不进行的话你将得到这样一张图片（即DarkNet的输出，当然只有框框的部分，后面的分类情况并未可视化。（其实就是对我来说难。。。））：

（有意思。。。都是框框，插一句，这里一共生成了10647个候选框，前面有提到。）

    
    box_corner = prediction.new(prediction.shape)
    # 坐标减长宽的一半，即坐标的绝对长度。
    box_corner[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2)# 左上
    box_corner[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2)
    box_corner[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2)# 右下
    box_corner[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2)
    prediction[:,:,:4] = box_corner[:,:,:4]# 覆盖原有坐标
    
    batch_size = prediction.size(0)
    # 判断是否初始化output
    write = False
    
    for ind in range(batch_size):
        # 提货！
        image_pred = prediction[ind]
        # 只选择最高的分类得分，通过dim=1来实现，返回每行最大值及最大值索引。
        max_conf, max_conf_score = torch.max(image_pred[:,5:5 + num_classes], 1)
        max_conf = max_conf.float().unsqueeze(1)
        max_conf_score = max_conf_score.float().unsqueeze(1)
        seq = (image_pred[:,:5], max_conf, max_conf_score)
        # 至此，每个预测框的参数为[:,x_left_top, y_left_top, x_right_bottom, y_right_bottom, object confidence, class score, class score index]
        image_pred = torch.cat(seq, 1)
        # 重申一下，网络的输出维度为[batch size, number of bounding boxes predicted per image, bounding box attributes]
        # 此处torch.nonzero()返回的是值非零的，如同矩阵坐标的，坐标（拗口again。。。）。
        non_zero_ind = (torch.nonzero(image_pred[:,4]))
        try:
            # 进行目标得分非零目标框挑选。(-1为自适应维度)
            image_pred_= image_pred[non_zero_ind.squeeze(),:].view(-1, 7)
        except:
            continue
            
        if image_pred_.shape[0] == 0:
            continue
        
        # 得到检测到的分类的不同类别的情况。
        img_classes = unique(image_pred_[:,-1])# 最后一个为类别索引，即第几个类别。
        
        # 此时img_classes维度为[预测框数目,类别]
        for cls in img_classes:
            # 得到某一类的检测；非该类的行变为0行。
            cls_mask = image_pred_*(image_pred_[:,-1] == cls).float().unsqueeze(1)
            class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze()
            # 其实搞这么多，搞到这一步就是把对应该类的行提出了出来，OK,fine。
            image_pred_class = image_pred_[class_mask_ind].view(-1, 7)
            # 目标得分排序，将最大值排在第一位；返回索引序列。
            conf_sort_index = torch.sort(image_pred_class[:,4], descending = True)[1]
            # 在都是预测到该类的情况下，将原序列按目标得分从高到低排序。
            image_pred_class = image_pred_class[conf_sort_index]
            # 把对应值也取出来了，以备NMS时使用，挺儿好儿。
            idx = image_pred_class.size(0)

紧接着进行的是NMS操作，在进行之前先看看未进行NMS的图像若何：
。。。
显然，每个GT都有若干个很接近的预测框，那么NMS就是让我们去选取最接近的那个，talk is cheap(One more time lol…)

            # 很ok，是时候表演真正的技术了——NMS
            # 此处类似冒泡排序，逐一比对，来进行最合适的bounding boxes的筛选。
            for i in range(idx):
                try:
                    # 第 i 个bounding box 与其往后的bounding box之间的IoU
                    ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:])
                except ValueError:
                    break
                    
                except IndexError:
                    break
                    
                # 对大于阈值的预测进行清零，即排除等效类似框。
                iou_mask = (ious < nms_conf).float().unsqueeze(1)
                image_pred_class[i+1:] *= iou_mask
                
                # 移除非零项
                non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze()
                # 只留下了IoU小于阈值的bounding box
                image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)
            # 通过Write判断是否已初始化Tensor，来进行规划该batch的输出。
            batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)

            seq = batch_ind, image_pred_class

            if not write:
                # 按列进行Concatenate。
                output = torch.cat(seq, 1)
                write = True
            else:
                out = torch.cat(seq, 1)
                output = torch.cat((output, out))
            
    try:
        return output
    except:
        return 0
    
# 个人理解为裁剪图像并在周围补上画布（也不知道为啥要补画布。。。）。
def letterbox_image(img, inp_dim):
    img_w, img_h = img.shape[1], img.shape[0]
    w, h = inp_dim
    new_w = int(img_w * min(w/img_w, h/img_h))
    new_h = int(img_h * min(w/img_w, h/img_h))
    resized_image = cv2.resize(img, (new_w, new_h), interpolation = cv2.INTER_CUBIC)
    
    canvas = np.full((inp_dim[1], inp_dim[0], 3), 128)
    
    canvas[(h-new_h)//2:(h-new_h)//2 + new_h, (w-new_w)//2:(w-new_w)//2 + new_w, :] = resized_image

    return canvas

# 输入图片的准备
def prep_image(img, inp_dim):
    img = letterbox_image(img, (inp_dim, inp_dim))
    img = img[:,:,::-1].transpose((2, 0, 1)).copy()
    img = torch.from_numpy(img).float().div(255.0).unsqueeze(0)
    return img

def load_classes(namesfile):
    fp = open(namesfile, "r")
    names = fp.read().split("\n")[:-1]
    return names

构建DarkNet-53:

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
from Customize import * 



def get_test_input():
    img_path = 'C:/Users/LauCh/Desktop/CV/YOLO/dog-cycle-car.png'
    img = cv2.imread(img_path)
#     cv2.imshow('OriginalPicture', img)
    img = cv2.resize(img, (416, 416))# 变为输入的维度
    img_ = img[:,:,::-1].transpose((2, 0, 1))# BGR -> RGB \ H x W x C -> C x H x W
    img_ = img_[np.newaxis,:,:,:]/255.0# 在开头加入Batch size维度 \ 归一化
    img_ = torch.from_numpy(img_).float()
    img_ = Variable(img_)
               
    return img_

# 将参数名与参数值，解析为哈希表中的键值对
def parse_cfg(cfgfile):
    file = open(cfgfile, 'r')
    # 按转行符划分
    lines = file.read().split('\n')
    # 排除空行
    lines = [x for x in lines if len(x) > 0]
    # 排除注释
    lines = [x for x in lines if x[0] != "#"]
    # 排除两端空格
    lines = [x.rstrip().lstrip() for x in lines]

    block = {}
    blocks = []

    for line in lines:
        if line[0] == "[":# 判断是否新block开始
            if len(block) != 0:# 哈希表还存有之前的键值对
                blocks.append(block)# 加入前一个block已完成读取的键值对
                block = {}# 清零
            block["type"] = line[1:-1].rstrip()# 定义模块类型（除零同理于键值对部分）
        else:
            key, value = line.split("=")
            # 空格来源于源文件（制作的时候格式还不一致。。。），使用.lstrip()和.rstrop()清除
            block[key.rstrip()] = value.lstrip()
    blocks.append(block)# 读取最后一个block
    
    return blocks

class EmptyLayer(nn.Module):
    def __init__(self):
        super(EmptyLayer, self).__init__()

class DetectionLayer(nn.Module):
    def __init__(self, anchors):
        super(DetectionLayer, self).__init__()
        self.anchors = anchors

def create_modules(blocks):
    net_info = blocks[0]
    module_list = nn.ModuleList()
    prev_filters = 3 # RGB
    output_filters = [] # 输出通道数

    for index, x in enumerate(blocks[1:]):
        module = nn.Sequential()
        if (x["type"] == "convolutional"):
            activation = x["activation"]
            # 通过batch_normalize判断是否有bn层
            try:
                batch_normalize = int(x["batch_normalize"])
                bias = False
            except:
                batch_normalize = 0
                bias = True

            filters = int(x["filters"])
            padding = int(x["pad"])
            stride = int(x["stride"])
            kernel_size = int(x["size"])

            # 通过padding判断是否补零
            if padding:
                pad = (kernel_size - 1) // 2
            else:
                pad = 0

            # 卷积层
            conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias=bias)
            # nn.Module.add_module(name, module)
            module.add_module("conv_{0}".format(index), conv)
            # BN层
            if batch_normalize:
                bn = nn.BatchNorm2d(filters)
                module.add_module("bn_{0}".format(index), bn)
            # Non-linear activation（非线性激活层）
            if activation == 'leaky':
                activ = nn.LeakyReLU(0.1, True)
                module.add_module("activn_{0}".format(index), activ)
        # 上采样层
        elif (x['type'] == 'upsample'):
            stride = int(x['stride'])
            # 上采样规模：2，即为原先的2倍。上采样方式：双线性插值
            upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
            module.add_module("upsample_{0}".format(index), upsample)
             
        elif (x["type"] == "route"):
            x["layers"] = x["layers"].split(',')
            # Start: 相对于此层的起点，end: 同理
            start = int(x["layers"][0])
            try:
                end = int(x["layers"][1])
            except:
                end = 0
            if start > 0:
                start = start - index
            if end > 0:
                end = end - index
            route = EmptyLayer()
            module.add_module("route_{0}".format(index), route)
            if end < 0:
                # 输出特征图的层序数（如果只有一个值），如果有两个值，则为维度上的conctenate操作。
                filters = output_filters[index + start] + output_filters[index + end]
            else:
                filters = output_filters[index + start]


        elif (x["type"] == "shortcut"):
            shortcut = EmptyLayer()
            module.add_module("shortcut_{}".format(index), shortcut)
             
        elif (x["type"] == "yolo"):
            # Mask: 面罩，就是指用Anchors里的哪几个。
            mask = x["mask"].split(",")
            mask = [int(x) for x in mask]

            anchors = x["anchors"].split(",")
            anchors = [int(a) for a in anchors]
            # 建立anchors对，如（10， 13）
            anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in mask]

            detection = DetectionLayer(anchors)
            module.add_module("Detection_{}".format(index), detection)
             
        module_list.append(module)
        prev_filters = filters
        output_filters.append(filters)

    return (net_info, module_list)


# Defining Darknet
class Darknet(nn.Module):
    def __init__(self, cfgfile):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfgfile)
        self.net_info, self.module_list = create_modules(self.blocks)
        
    def forward(self, x, CUDA):
#         device = "cuda:0" if CUDA else "cpu"
        modules = self.blocks[1:]
        output = {}
        write = 0
        for i, module in enumerate(modules):
            module_type = (module["type"])
            if (module_type == "convolutional" or module_type == "upsample"):
                x = self.module_list[i](x)
            elif (module_type == "route"):
                layers = module["layers"]
                layers = [int(a) for a in layers]
                
                if (layers[0] > 0):
                    layers[0] = layers[0] - i
                    
                if len(layers) == 1:
                    x = output[i + (layers[0])]
                    
                else:
                    if (layers[1] > 0):
                        layers[1] = layers[1] - i
                        
                    map1 = output[i + layers[0]]
                    map2 = output[i + layers[1]]
                    # torch.cat(tensors, dim=0, out=None)：dim=1为按列相接。
                    x = torch.cat((map1, map2), 1)
                    
            elif module_type == "shortcut":
                from_ = int(module["from"])
                # Shortcut层的输出由前一层输出和与当前层相距from_层的层组成（拗口。。。）
                x = output[i-1] + output[i+from_]

            elif module_type == "yolo":
                # 选择anchors类型。
                anchors = self.module_list[i][0].anchors
                # 输入维度
                inp_dim = int(self.net_info["height"])
                # 分类类型
                num_classes = int(module["classes"])
                #转换
                x = x.data
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                if not write:
                    detections = x
                    write = 1
                else:
                    detections = torch.cat((detections, x), 1)

            output[i] = x

        return detections
                    
        # 这里缺一个载入预训练模型参数的函数load_weights。。。偷懒。。。明天写
    # 好了，我又回来了。。。
    def load_weights(self, weightfile):
        fp = open(weightfile, "rb")

        # 提取前 5 个数据。
        # 前 5 个数据包含为：
        # 1. 较大版本数
        # 2. 较小版本数
        # 3. 中间版本数
        # 4，5. 训练期间，网络可见的图像
        header = np.fromfile(fp, dtype = np.int32, count = 5)
        self.header = torch.from_numpy(header)
        self.seen = self.header[3]

        # 剩下的即权重
        weights = np.fromfile(fp, dtype = np.float32)

        ptr = 0
        for i in range(len(self.module_list)):
            # 第一个为网络信息，舍去。
            module_type = self.blocks[i + 1]["type"]
            if module_type == "convolutional":
                model = self.module_list[i]
                # 判断有无batch norm，这将决定bias的有无。有前无后，反之亦然。
                try:
                    batch_normalize = int(self.blocks[i+1]["batch_normalize"])
                except:
                    batch_normalize = 0   
                conv = model[0]
                if(batch_normalize):
                    bn = model[1]
                    
                    # Batch norm bias的总数。
                    num_bn_biases = bn.bias.numel()
                    # Torch.from_numpy(ndarry)：与ndarry共享数据（我瞎猜是指向同一地址），省空间。
                    bn_biases = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr += num_bn_biases
                    
                    bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr += num_bn_biases
                    
                    bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr += num_bn_biases
                    
                    bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr += num_bn_biases
                    
                    # 调整维度。.view_as(other)使得维度与other一样。
                    bn_biases = bn_biases.view_as(bn.bias.data)
                    bn_weights = bn_weights.view_as(bn.weight.data)
                    bn_running_mean = bn_running_mean.view_as(bn.running_mean)
                    bn_running_var = bn_running_var.view_as(bn.running_var)
                    
                    # 将参数代入模型。
                    bn.bias.data.copy_(bn_biases)
                    bn.weight.data.copy_(bn_weights)
                    bn.running_mean.copy_(bn_running_mean)
                    bn.running_var.copy_(bn_running_var)
                    
                else:
                    # 偏置
                    num_biases = conv.bias.numel()
                    
                    conv_biases = torch.from_numpy(weights[ptr : ptr + num_biases])
                    ptr += num_biases
                    
                    conv_biases = conv_biases.view_as(conv.bias.data)
                    
                    conv.bias.data.copy_(conv_biases)
                    
                # 权重
                num_weights = conv.weight.numel()

                conv_weights = torch.from_numpy(weights[ptr: ptr + num_weights])
                ptr = ptr + num_weights
                
                conv_weights = conv_weights.view_as(conv.weight.data)

                conv.weight.data.copy_(conv_weights)

构建检测器：

from __future__ import division
import time
import torch
import torch.nn as nn 
from torch.autograd import Variable
import numpy as np
import cv2
from Customize import *
import argparse
import os
import os.path as osp
from DarkNet53 import Darknet
import pickle as pkl
import pandas as pd
import random


def arg_parse():
    parser = argparse.ArgumentParser(description='YOLO v3 Detection Module')
    
    parser.add_argument("--images", dest = 'images', help = 
                        "Image / Directory containing images to perform detection upon",
                       default = "C:\\YOLOv3\\YOLO_v3_tutorial_from_scratch-master\\imgs", type = str)
    parser.add_argument("--det", dest = 'det', help = 
                        "Image / Directory to store detections to",
                       default = "C:\\YOLOv3\\YOLO_v3_tutorial_from_scratch-master\\test_results", type = str)
    parser.add_argument("--bs", dest = 'bs', help = "Batch size", default = 1)
    parser.add_argument("--confidence", dest = "confidence", help = "Object Confidence to filter predictions", default = 0.5)
    parser.add_argument("--nms_thresh", dest = "nms_thresh", help = "NMS Threshhold", default = 0.4)
    parser.add_argument("--cfg", dest = 'cfgfile', help =
                       "Config file",default = "C:\\YOLOv3\\YOLO_v3_tutorial_from_scratch-master\\cfg\\yolov3.cfg", type = str)
    parser.add_argument("--weights", dest = 'weightsfile', help =
                       "weightsfile",
                       default = "C:\\YOLOv3\\YOLO_v3_tutorial_from_scratch-master\\yolov3.weights", type = str)
    parser.add_argument("--reso", dest = 'reso', help = 
                       "Input resolution of the network. Increase to increase accuracy. Decrease to increase speed",
                       default = "416", type = str)
    
    return parser.parse_args()
    
args = arg_parse()
images = args.images
batch_size = int(args.bs)
confidence = float(args.confidence)
nms_thesh = float(args.nms_thresh)
start =0
CUDA =torch.cuda.is_available()
                        
num_classes = 80
classes = load_classes("C:\YOLOv3\YOLO_v3_tutorial_from_scratch-master\data\coco.names")

print(29*"-"+"Loading Network..."+29*"-")
model = Darknet(args.cfgfile)
model.load_weights(args.weightsfile)
print(29*"-"+"Loading Finished"+29*"-")

model.net_info["height"] = args.reso
inp_dim = int(model.net_info["height"])
# 遇错提前崩溃
assert inp_dim % 32 == 0
assert inp_dim > 32

if CUDA:
    model.cuda()
    
# 评价模式，测试时使用
model.eval()

read_dir = time.time()

try:
    imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images)]
except NotADirectoryError:
    imlist = []
    imlist.append(osp.join(osp.realpath('.'), images))
except FileNotFoundError:
    print("No filr or directory with the name {}（就是找不见了。。。）".format(images))
    exit()

if not os.path.exists(args.det):
    os.makedirs(args.det)
    
load_batch = time.time()
loaded_ims = [cv2.imread(x) for x in imlist]


im_batches = list(map(prep_image, loaded_ims, [inp_dim for x in range(len(imlist))]))

# 图像维度
im_dim_list = [(x.shape[1], x.shape[0]) for x in loaded_ims]
im_dim_list = torch.FloatTensor(im_dim_list).repeat(1, 2)

   
if CUDA:
    im_dim_list = im_dim_list.cuda()
    
leftover = 0    
if (len(im_dim_list) % batch_size):
    leftover = 1
    
if batch_size != 1:
    num_batches = len(imlist) // batch_size + leftover
    im_batches = [torch.cat((im_batches[i*batch_size : min((i + 1)*batch_size, 
                    len(im_batches))])) for i in range(num_batches)]

write = 0
start_det_loop = time.time()
for i, batch in enumerate(im_batches):
#     print(i)
    start = time.time()
    if CUDA:
        batch = batch.cuda()
    with torch.no_grad():
        # 此处输出为一堆的框及其对应的预测。
        prediction = model(Variable(batch), CUDA)
    # BUG：输出为0
    # write_results函数内进行目标得分最大选取（即分类结果）和NMS。
    prediction = write_results(prediction, confidence, num_classes, nms_conf = nms_thesh)
    
    end = time.time()
#     print(type(prediction))
#     print(prediction)
    if type(prediction) == int:
        
        for im_num, image in enumerate(imlist[i*batch_size: min((i + 1)*batch_size, len(imlist))]):
            # 图片编号
            im_id = i*batch_size + im_num
#             print(im_id)
            print("{0:20}s predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
            print("{0:20}s {1:s}".format("Object Detected:", ""))
            print(60*"-")
        continue
        
    prediction[:,0] += i*batch_size
#     print(prediction)
    
    if not write:
        output = prediction
        write = 1
    else:
        output = torch.cat((output, prediction))
        
    for in_num, image in enumerate(imlist[i*batch_size: min((i + 1)*batch_size, len(imlist))]):
        im_id = i*batch_size + in_num
#         print(im_id)
        objs = [classes[int(x[-1])] for x in output if int(x[0]) == im_id]
        print(objs)
        print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("\\")[-1], (end - start)/batch_size))
        print("{0:20s} {1:s}".format("Object Detected", " ".join(objs)))
        print(60*"-")
        
    if CUDA:
        torch.cuda.synchronize()
# print(output)
try:
    output
except NameError:
    print ("No detections were made")
    exit()        
        
im_dim_list = torch.index_select(im_dim_list, 0, output[:,0].long())

scaling_factor = torch.min(416/im_dim_list, 1)[0].view(-1, 1)
# 坐标转换
output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim_list[:,0].view(-1, 1))/2
output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim_list[:,1].view(-1, 1))/2

output[:, 1:5] /= scaling_factor

for i in range(output.shape[0]):
    output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim_list[i, 0])
    output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim_list[i, 1])

output_recast = time.time()
class_load = time.time()
colors = pkl.load(open("C:\YOLOv3\YOLO_v3_tutorial_from_scratch-master\pallete", "rb"))

draw = time.time()
# print(output)
# print(loaded_ims)

def write_so(x, results):
    c1 = tuple(x[1:3].int())
    c2 = tuple(x[3:5].int())
    img = results[int(x[0])]
    cls = int(x[-1])
    color = random.choice(colors)
    label = "{0}".format(classes[cls])
    cv2.rectangle(img, c1, c2, color, 1)
    t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0]
    c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4
    cv2.rectangle(img, c1, c2,color, -1)
    cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1,
                [225,255,255], 1);
    return img

list(map(lambda x: write_so(x, loaded_ims), output))
det_names = pd.Series(imlist).apply(lambda x: "{}\\det_{}".format(args.det, x.split("\\")[-1]))
list(map(cv2.imwrite, det_names, loaded_ims))
end = time.time()

print("SUMMARY")
print("----------------------------------------------------------")
print("{:25s}: {}".format("Task", "Time Taken (in seconds)"))
print()
print("{:25s}: {:2.3f}".format("Reading addresses", load_batch - read_dir))
print("{:25s}: {:2.3f}".format("Loading batch", start_det_loop - load_batch))
print("{:25s}: {:2.3f}".format("Detection (" + str(len(imlist)) +  " images)", output_recast - start_det_loop))
print("{:25s}: {:2.3f}".format("Output Processing", class_load - output_recast))
print("{:25s}: {:2.3f}".format("Drawing Boxes", end - draw))
print("{:25s}: {:2.3f}".format("Average time_per_img", (end - load_batch)/len(imlist)))
print("----------------------------------------------------------")


torch.cuda.empty_cache()