YOLO v3

最新推荐文章于 2021-03-29 13:44:42 发布

xiaochengJF

最新推荐文章于 2021-03-29 13:44:42 发布

阅读量565

点赞数 1

分类专栏：目标检测文章标签：深度学习目标检测计算机视觉卷积神经网络

本文链接：https://blog.csdn.net/weixin_43711554/article/details/90452706

版权

目标检测专栏收录该内容

41 篇文章 1 订阅

订阅专栏

本文深入解析YOLOv3目标检测算法，包括其采用的Darknet-53网络结构，多尺度特征检测机制，改进后的损失函数，以及如何在不同尺度下分配Anchor。同时，介绍了对象分类从softmax改为logistic的优势，以及代码实现中如何处理预测框定位、NMS等关键步骤。

摘要由CSDN通过智能技术生成

论文：YOLOv3: An Incremental Improvement （CVPR 2018）
代码：eriklindernoren/PyTorch-YOLOv3
Jupyter 代码梳理笔记：YOLOv3 Darknet

文章目录

Darknet-53

YOLO3 采用 Darknet-53 作为 Backbone（含有53个卷积层），借鉴残差网络，在一些层之间设置了Shortcut connections，弃用YOLOv2中最大池化操作，通过增大卷积层的步长降低特征图分辨率，网络架构如下：
在这里插入图片描述

多尺度特征检测

与YOLOv2中的passthrough结构不同，在YOLO3更进一步采用了3个不同尺度的特征图来进行对象检测，三个输出尺寸为： $\color{blue}S*S*3*(box_{atr}+class_{num})$ ，细节部分从下面两个图可以看出（出处应该在参考文献中）
在这里插入图片描述

Loss的构成

了解如何构建target，就清楚Loss的组成了

def build_targets(pred_boxes, target, anchors, num_anchors, num_classes, dim, ignore_thres, img_dim):
    nB = target.size(0)  #batch个数  16
    nA = num_anchors     #锚框个数   3
    nC = num_classes     #数据集类别数  80
    dim = dim            #feature map相对于原图的缩放倍数13

    # 初始化参数
    mask        = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    conf_mask   = torch.ones(nB, nA, dim, dim)      #[16,3,13,13]   全1
    tx          = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    ty          = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    tw          = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    th          = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    tconf       = torch.zeros(nB, nA, dim, dim)     #[16,3,13,13]   全0
    tcls        = torch.zeros(nB, nA, dim, dim, num_classes)    #[16,3,13,13,80]  全0

    # 为了计算一个batch中的recall召回率
    nGT = 0  # 统计 真值框个数 GT ground truth
    nCorrect = 0  # 统计 预测出有物体的个数 （即 真值框 与 3个原始锚框与真值框iou最大的那个锚框对应的预测框  之间的iou > 0.5 为预测正确）

    # 遍历每一张图片
    for b in range(nB):
        #遍历一张图片的所有物体
        for t in range(target.shape[1]):
            if target[b, t].sum() == 0:
                # 即代表遍历完所有物体，continue直接开始下一次for循环(译者：使用break直接结束for循环更好)
                continue
            nGT += 1
            # Convert to position relative to box
            # target真值框 坐标被归一化后[16,50,5] 值在0-1之间。故乘以 dim  将尺度转化为  13x13尺度下的真值框
            gx = target[b, t, 1] * dim
            gy = target[b, t, 2] * dim
            gw = target[b, t, 3] * dim
            gh = target[b, t, 4] * dim
            # Get grid box indices 向下取整，获取网格框索引，即左上角偏移坐标
            gi = int(gx)
            gj = int(gy)
            # Get shape of gt box [1,4]
            gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)
            # Get shape of anchor box [3,4]   前两列全为0  后两列为 三个anchor的w、h
            anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((len(anchors), 2)), np.array(anchors)), 1))
            # Calculate iou between gt and anchor shapes
            # 计算 一个真值框 与  对应的3个原始锚框  之间的iou
            anch_ious = bbox_iou(gt_box, anchor_shapes)
            # Where the overlap is larger than threshold set mask to zero (ignore)   当iou重叠率>阈值，则置为0
            # conf_mask全为1 [16,3,13,13]  当一个真值框 与  一个原始锚框  之间的iou > 阈值时，则置为0。
            # 即 将 负责预测物体的网格及 它周围的网格 都置为0 不参与训练，后面的代码会 将负责预测物体的网格再置为1。
            conf_mask[b, anch_ious > ignore_thres] = 0  ########### 小于阈值（0.5）的就作为背景
            # Find the best matching anchor box  找到 一个真值框 与  对应的3个原始锚框  之间的iou最大的  下标值
            best_n = np.argmax(anch_ious)
            # Get ground truth box [1,4]
            gt_box = torch.FloatTensor(np.array([gx, gy, gw, gh])).unsqueeze(0)
            # Get the best prediction  [1,4]
            # pred_boxes:在13x13尺度上的预测框
            # pred_box：取出  3个原始锚框与 真值框 iou最大的那个锚框  对应的预测框
            pred_box = pred_boxes[b, best_n, gj, gi].unsqueeze(0)
            # Masks   [16,3,13,13]   全0      在3个原始锚框与 真值框 iou最大的那个锚框  对应的预测框位，即 负责预测物体的网格置为1 （此时它周围网格为0，思想类似nms）
            mask[b, best_n, gj, gi] = 1
            #  [16,3,13,13]   全1 然后将 负责预测物体的网格及 它周围的网格 都置为0 不参与训练 ，然后  将负责预测物体的网格再次置为1。
            #  即总体思想为： 负责预测物体的网格 位置置为1，它周围的网格置为0。类似NMS 非极大值抑制
            ################背景+Box_maxiou相当于rpn中的正样本+负样本，因为都要计入损失################
            conf_mask[b, best_n, gj, gi] = 1  
            # Coordinates 坐标     gi= gx的向下取整。  gx-gi、gy-gj 为 网格内的 物体中心点坐标（0-1之间）
            # tx  ty初始化全为0，在有真值框的网格位置写入 真实的物体中心点坐标
            tx[b, best_n, gj, gi] = gx - gi
            ty[b, best_n, gj, gi] = gy - gj
            # Width and height
            #  论文中 13x13尺度下真值框=原始锚框 x 以e为底的 预测值。故预测值= log(13x13尺度下真值框  / 原始锚框  +  1e-16 )
            tw[b, best_n, gj, gi] = math.log(gw/anchors[best_n][0] + 1e-16)
            th[b, best_n, gj, gi] = math.log(gh/anchors[best_n][1] + 1e-16)
            # One-hot encoding of label
            tcls[b, best_n, gj, gi, int(target[b, t, 0])] = 1
            # Calculate iou between ground truth and best matching prediction 计算真值框 与   3个原始锚框与真值框iou最大的那个锚框对应的预测框    之间的iou
            iou = bbox_iou(gt_box, pred_box, x1y1x2y2=False)
            # [16,3,13,13]   全0，有真值框对应的网格位置为1  标明 物体中心点落在该网格中，该网格去负责预测物体
            tconf[b, best_n, gj, gi] = 1

            if iou > 0.5:
                nCorrect += 1
    # nGT 统计一个batch中的真值框个数
    # nCorrect 统计 一个batch预测出有物体的个数
    # mask   [16,3,13,13] 初始化全0   在3个原始锚框与 真值框 iou最大的那个锚框  对应的预测框位置置为1 
    # conf_mask  [16,3,13,13]  初始化全1，之后的操作：负责预测物体的网格置为1，它周围网格置为0
    # tx, ty [16,3,13,13] 初始化全为0，在有真值框的网格位置写入 真实的物体中心点坐标
    # tw, th  [16,3,13,13] 初始化全为0，该值为 真值框的w、h 按照公式转化为 网络输出时对应的真值（该值对应于 网络输出的真值）
    # tconf [16,3,13,13]   初始化全0，有真值框对应的网格位置为1  标明 物体中心点落在该网格中，该网格去负责预测物体
    # tcls    #[16,3,13,13,80]  初始化全0，经过one-hot编码后  在真实类别处值为1
    return nGT, nCorrect, mask, conf_mask, tx, ty, tw, th, tconf, tcls

由target可知YOLOv3的Loss依旧可以大体上分成两部分

预测有目标的框（与gt_box IoU最大）：tx, ty, tw, th, tconf, tcls这些参数Loss
作为背景的框（与gt_box IoU小于阈值的框）：相当于没有目标而预测有目标，所以其置信度要计入损失，代码中将有目标和无目标的置信度位置用一个conf_mask合并了

细节补充

对象分类softmax改成logistic

预测对象类别时不使用softmax，改成使用logistic的输出进行预测。这样能够支持多标签对象（比如一个人有Woman 和 Person两个标签）

Anchor分配

聚类得到的9个Anchor被三个尺度的特征层平分，深层的特征图谱尺寸小，感受野大，分配的Anchor size也更大，darknet中的mask就是Anchor的索引号

Feature	Feature_size	Anchors_size	Anchors_num
Feature 1	$13\times 13$	[116,90]、[156,198]、[372，326]	$13\times 13\times 3$
Feature 2	$26\times 26$	[31,61]、[62,45]、[59,119]	$26\times 26\times 3$
Feature 3	$52\times 52$	[10,13]、[16,30]、[33,23]	$52\times 52\times 3$

划分格子并为每个格子设置3个Anchor

# grid_x、grid_y用于 定位 feature map的网格左上角坐标
grid_x = torch.linspace(0, g_dim-1, g_dim).repeat(g_dim,1).repeat(bs*self.num_anchors, 1, 1).view(x.shape).type(FloatTensor)    # [16.3.13.13]  每行内容为0-12,共13行
grid_y = torch.linspace(0, g_dim-1, g_dim).repeat(g_dim,1).t().repeat(bs*self.num_anchors, 1, 1).view(y.shape).type(FloatTensor)  # [16.3.13.13]  每列内容为0-12,共13列（因为使用转置T）
scaled_anchors = [(a_w / stride, a_h / stride) for a_w, a_h in self.anchors]  #将 原图尺度的锚框也缩放到统一尺度下
anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))  #[3,1]  3个锚框的w值
anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))  #[3,1]  3个锚框的h值
anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, g_dim*g_dim).view(w.shape) #[16,3,13,13]
anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, g_dim*g_dim).view(h.shape) #[16,3,13,13]

将预测框定位在在划分好格子的图上

# Add offset and scale with anchors  给锚框添加偏移量和比例
pred_boxes = FloatTensor(prediction[..., :4].shape)  #新建一个tensor[16,3,13,13,4]
# pred_boxes为 在13x13的feature map尺度上的预测框
# x,y为预测值（网格内的坐标，经过sigmoid之后值为0-1之间） grid_x，grid_y定位网格左上角偏移坐标（值在0-12之间）
pred_boxes[..., 0] = x.data + grid_x
pred_boxes[..., 1] = y.data + grid_y
# w，h为 预测值，即相对于原锚框的偏移值    anchor_w，anchor_h为 网格对应的3个锚框
pred_boxes[..., 2] = torch.exp(w.data) * anchor_w
pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

对各类别分别进行NMS

def non_max_suppression(prediction, num_classes, conf_thres=0.5, nms_thres=0.4):
    """
    Removes detections with lower object confidence score than 'conf_thres' and performs
    Non-Maximum Suppression to further filter detections.
    Returns detections with shape:
        (x1, y1, x2, y2, object_conf, class_score, class_pred)
    """

    # From (center x, center y, width, height) to (x1, y1, x2, y2)
    box_corner = prediction.new(prediction.shape)
    box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2
    box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2
    box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2
    box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2
    prediction[:, :, :4] = box_corner[:, :, :4]

    output = [None for _ in range(len(prediction))]
    for image_i, image_pred in enumerate(prediction):
        # Filter out confidence scores below threshold
        conf_mask = (image_pred[:, 4] >= conf_thres).squeeze()
        image_pred = image_pred[conf_mask]
        # If none are remaining => process next image
        if not image_pred.size(0):
            continue
        # Get score and class with highest confidence
        class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1,  keepdim=True)
        # Detections ordered as (x1, y1, x2, y2, obj_conf, class_conf, class_pred)
        detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1)
        # Iterate through all predicted classes
        unique_labels = detections[:, -1].cpu().unique()
        if prediction.is_cuda:
            unique_labels = unique_labels.cuda()
        for c in unique_labels:
            # Get the detections with the particular class
            detections_class = detections[detections[:, -1] == c]
            # Sort the detections by maximum objectness confidence
            _, conf_sort_index = torch.sort(detections_class[:, 4], descending=True)
            detections_class = detections_class[conf_sort_index]
            # Perform non-maximum suppression
            max_detections = []
            while detections_class.size(0):
                # Get detection with highest confidence and save as max detection
                max_detections.append(detections_class[0].unsqueeze(0))
                # Stop if we're at the last detection
                if len(detections_class) == 1:
                    break
                # Get the IOUs for all boxes with lower confidence
                ious = bbox_iou(max_detections[-1], detections_class[1:])
                # Remove detections with IoU >= NMS threshold
                detections_class = detections_class[1:][ious < nms_thres]

            max_detections = torch.cat(max_detections).data
            # Add max detections to outputs
            output[image_i] = max_detections if output[image_i] is None else torch.cat((output[image_i], max_detections))
    return output

代码问题

Colab上跑遇到的问题
问题一
TypeError: ‘NoneType’ object is not subscriptable
解决方法：
datasets.py第128行

if self.augment: 
改为：
if self.augment and targets is not None:

 del checkpoint

pytorch 减小显存消耗，优化显存使用，避免out of memory

torch.backends.cudnn.benchmark = true

设置这个 flag 可以让内置的 cuDNN 的 auto-tuner 自动寻找最适合当前配置的高效算法，来达到优化运行效率的问题
注意：1) 如果网络的输入数据维度或类型上变化不大，可以增加运行效率；2) 如果网络的输入数据在每次 iteration 都变化的话，会导致 cnDNN 每次都会去寻找一遍最优配置，这样反而会降低运行效率