【目标检测】YOLO v3

可乐大牛

已于 2022-11-22 10:11:03 修改

阅读量1.9k

点赞数 1

分类专栏：论文学习文章标签：其他

于 2022-04-25 14:56:33 首次发布

本文链接：https://blog.csdn.net/qq_44173974/article/details/124402679

版权

论文学习专栏收录该内容

68 篇文章 9 订阅

订阅专栏

1、超级无敌通俗的YOLO系列讲解

链接

2、论文思想

概述

YOLOv3是单阶段目标检测算法YOLO的第三个版本，广泛用于工业界，它改进了骨干网络、正负样本选取和损失函数，并引入了FPN特征金字塔多尺度预测，显著提升了速度和精度。

改进

骨干网络：DarkNet53，就是YOLO v2中的DarkNet19+Resnet。darknet53是52个卷积层+1个全连接层，但是在COCO数据集上训练好了分类的网络之后，就去掉了后面的全连接层，单纯作为全卷积网络在YOLO模型中使用，因此可以兼容任意尺度的输入。包含了大量的残差模块，由大量的1X1和3X3的卷积构成，借鉴了ResNet的做法，在各层之间建立快捷链路，通过跃层连接，解决了网络逐步深化时模型难以优化的问题，同时这样做可以利用到更多的浅层网络中的的图像细粒度信息。
在这里插入图片描述

正负样本选取：YOLO v3中的bbox有三类，分别是正样本、负样本和不参与损失函数计算的样本。正样本：任取一个ground truth，与4032个框全部计算IOU，IOU最大的预测框，即为正样本。并且一个预测框，只能分配给一个ground truth。例如第一个ground truth已经匹配了一个正例检测框，那么下一个ground truth，就在余下的4031个检测框中，寻找IOU最大的检测框作为正样本。负样本：正样本除外，与全部ground truth的IOU都小于阈值（0.5），则为负例。不参与的部分：正例除外，与任意一个ground truth的IOU大于阈值（论文中使用0.5），则为不参与的部分。
注：正样本的置信度标签是1，负样本的置信度标签是0

损失函数：包括正样本的中心定位误差，宽高定位误差，置信度误差，分类误差，还有负样本的置信度误差。其中boundingbox的位置信息使用的是均方差损失，置信度和分类用的全部都是交叉熵损失(也就是置信度越接近1 损失越小，类别损失是对所有类别单独做二分类，逐类别计算交叉熵损失，也就是一个求和，里面是类别概率*(-log))

FPN的多尺度检测方法：就是骨干网络会对输入图片进行多次下采样得到不同尺寸的feature-map,再将它们上采样和浅层网络中长宽相同的feature-map进行concat操作，得到的feature-map既拥有了深度网络的语义特征，也有了浅层网络的细粒度特征，使得模型对于小目标和密集目标的预测能力更强。多尺度指的是，将图像转换为三种不同尺度的feature-map，分别来检测大、中、小三个类型的物体，在此基础上进行分类和位置回归，这大大改善了原始YOLO网络的检测准确率。
在这里插入图片描述

3、复现

学习资料：
yolov3硬核讲解（第一部分-YOLO-V3原理）
yolov3硬核讲解（第二部分-YOLO-V3网络结构+代码实现）
yolov3硬核讲解（第三部分-YOLO-V3数据制作+代码实现）
yolov3硬核讲解（第四部分-YOLO-V3训练和预测+代码实现）

Yolo.py

from torch import nn
from torch.nn import functional as Fun
import torch


# 一个基本的卷积块,后面都可以调用这个
class ConvolutionalLayer(nn.Module):
    def __init__(self, input_channels, output_channels, kernel_size, stride, padding, bias=False):
        super(ConvolutionalLayer, self).__init__()
        self.sub_module = nn.Sequential(
            nn.Conv2d(input_channels, output_channels, kernel_size, stride, padding, bias=bias),
            # stride为卷积步长，bias大概是与偏置有关的东西
            nn.BatchNorm2d(output_channels),  # 归一化
            nn.LeakyReLU()  # 与Relu相似，但是输出有负，具体百度
        )

    def forward(self, x):
        return self.sub_module(x)
        pass


# 残差块
class ResidualLayer(nn.Module):
    def __init__(self, input_channels, output_channels):
        super(ResidualLayer, self).__init__()
        self.sub_module = nn.Sequential(
            # ‘208x208，64个通道’，先进行‘1x1，32通道’卷积，得到‘208x208，32通道’卷积结果，再进行
            # '3x3，padding为1，64通道'卷积，得到‘208x208,32通道’卷积结果
            ConvolutionalLayer(input_channels, output_channels, 1, 1, 0),
            ConvolutionalLayer(output_channels, input_channels, 3, 1, 1)
        )

    def forward(self, x):
        return self.sub_module(x) + x  # 残差，加上原来的通道
        pass


# 金字塔部分的卷积块
class ConvolutionalSetLayer(nn.Module):
    def __init__(self, input_channels, output_channels):
        super(ConvolutionalSetLayer, self).__init__()
        self.sub_module = nn.Sequential(
            # 五次卷积
            ConvolutionalLayer(input_channels, output_channels, 1, 1, 0),
            ConvolutionalLayer(output_channels, input_channels, 3, 1, 1),
            ConvolutionalLayer(input_channels, output_channels, 1, 1, 0),
            ConvolutionalLayer(output_channels, input_channels, 3, 1, 1),
            ConvolutionalLayer(input_channels, output_channels, 1, 1, 0)
        )

    def forward(self, x):
        return self.sub_module(x)
        pass


# 下采样,1/2采样，通过卷积步长为2实现
class DownSamplingLayer(nn.Module, ):
    def __init__(self, input_channels, output_channels):
        super(DownSamplingLayer, self).__init__()
        self.sub_module = nn.Sequential(
            ConvolutionalLayer(input_channels, output_channels, 3, 2, 1)
        )

    def forward(self, x):
        return self.sub_module(x)


# 上采样，插值
class UpSamplingLayer(nn.Module):
    def __init__(self):
        super(UpSamplingLayer, self).__init__()

    def forward(self, x):
        # 邻近法插值，scale_factor表示插值后是原来的几倍，mode为插值方法
        return Fun.interpolate(x, scale_factor=2, mode='nearest')
        pass


# 构建整个网络
# 下采样的作用就是1、减小图像大小  2、增加图像通道
# 残差网络才是真正在提取特征
class yolo3_net(nn.Module):
    def __init__(self):
        super(yolo3_net, self).__init__()
        # 输出52x52的网络
        self.trunk_52 = nn.Sequential(
            # 最上面两步，未进入残差块
            ConvolutionalLayer(3, 32, 3, 1, 1),
            DownSamplingLayer(32, 64),

            # 第一个残差块
            ResidualLayer(64, 32),

            # 下采样
            DownSamplingLayer(64, 128),

            # 第二个残差块,调用两次
            ResidualLayer(128, 64),
            ResidualLayer(128, 64),
            # 下采样
            DownSamplingLayer(128, 256),

            # 第三个残差块，调用8次
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
            ResidualLayer(256, 128),
        )

        # 输出26x26的网络
        self.trunk_26 = nn.Sequential(
            # 下采样
            DownSamplingLayer(256, 512),

            # 第四个残差快，调用8次
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
            ResidualLayer(512, 256),
        )

        # 输出13x13的网络
        self.trunk_13 = nn.Sequential(
            # 下采样
            DownSamplingLayer(512, 1024),

            # 第五个残差块
            ResidualLayer(1024, 512),
            ResidualLayer(1024, 512),
            ResidualLayer(1024, 512),
            ResidualLayer(1024, 512)
        )

        # 13x13侦测网络
        self.convset_13 = nn.Sequential(
            ConvolutionalSetLayer(1024, 512)
        )

        self.detection_13 = nn.Sequential(
            # Convolutional Set输出的为512通道的特征，在进行预测其要还原成1024通道，所以再经过一个3x3卷积
            ConvolutionalLayer(512, 1024, 3, 1, 1),
            # 有三个先验框，三分类，（置信度、x,y,w,h的偏移量，每个种类的onehot编码），3x（5+3）=24
            nn.Conv2d(1024, 75, 1, 1, 0)
        )

        # 13x13上采样部分
        self.up_13to26 = nn.Sequential(
            ConvolutionalLayer(512, 256, 1, 1, 0),
            UpSamplingLayer()
        )

        # 26x26侦测网络
        self.convset_26 = nn.Sequential(
            ConvolutionalSetLayer(768, 256)  # 768=512+256
        )

        self.detection_26 = nn.Sequential(
            # Convolutional Set输出的为256通道的特征，在进行预测其要还原成512通道，所以再经过一个3x3卷积
            ConvolutionalLayer(256, 512, 3, 1, 1),
            # 有三个先验框，三分类，（置信度、x,y,w,h的偏移量，每个种类的onehot编码），3x（5+3）=24
            nn.Conv2d(512, 75, 1, 1, 0)
        )

        # 26x26上采样部分
        self.up_26to52 = nn.Sequential(
            ConvolutionalLayer(256, 128, 1, 1, 0),
            UpSamplingLayer()
        )

        # 52x52侦测网络
        self.convset_52 = nn.Sequential(
            ConvolutionalSetLayer(384, 128)
        )

        self.detection_52 = nn.Sequential(
            ConvolutionalLayer(128, 256, 3, 1, 1),
            nn.Conv2d(256, 75, 1, 1, 0)
        )

    def forward(self, x):
        # 每一个阶段的输出
        h_52 = self.trunk_52(x)
        h_26 = self.trunk_26(h_52)
        h_13 = self.trunk_13(h_26)

        # 13x13的检测和上采样拼接
        convset_13_out = self.convset_13(h_13)
        detection_13_out = self.detection_13(convset_13_out)
        up_13to26_out = self.up_13to26(convset_13_out)
        cat_26 = torch.cat((up_13to26_out, h_26), dim=1)  # dim=0是批次，dim=1是通道，dim=2是行（层面）,dim=3是列

        # 26x26的检测和上采样拼接
        convset_26_out = self.convset_26(cat_26)
        detection_26_out = self.detection_26(convset_26_out)
        up_26to52_out = self.up_26to52(convset_26_out)
        cat_52 = torch.cat((up_26to52_out, h_52), dim=1)

        # 52x52的检测和上采样拼接
        convset_52_out = self.convset_52(cat_52)
        detection_52_out = self.detection_52(convset_52_out)
        return detection_13_out, detection_26_out, detection_52_out


if __name__ == '__main__':
    net = yolo3_net()
    x = torch.randn(1, 3, 416, 416)
    a, b, c = net(x)
    print(a.shape, b.shape, c.shape)

dataset.py

import math
import os

import torch

from config import *
import numpy as np
from torch.utils.data import Dataset
from PIL import Image
from util import *
from torchvision import transforms



# 独热编码 就是n个类别就是n个0 当前是某个类别那就那个类别是1
# 参数：类别个数 当前是第几个类
def one_hot(cls_num,i):
    rst=np.zeros(cls_num)
    rst[i]=1
    return rst

# 数据格式为 (batch-size,3*(1+5+20),w,h)
class YoloDataSet(Dataset):
    def __init__(self):
        f=open('data.txt','r')
        self.dataset=f.readlines()

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        # .strip() 移除字符串头尾的空格
        # data: 2.jpg 1 100 106 200 213 1 118 141 236 283
        data=self.dataset[index].strip()
        temp_data=data.split()
        # ----------------
        # 得到真实框数组
        # ----------------

        # temp_data[1:]是后面所有的 cls x y w h 的五元组
        # 先把他们都转成数值 因为网络都是采用float的
        # 然后再5个一切
        _boxes=np.array([float(x) for x in temp_data[1:]])
        boxes=np.split(_boxes,len(_boxes)//5)
        # ----------------
        # 处理真实框所在的图片
        # ----------------

        # 将m*n的图片 变为l*l的图片 其中 l=max(m,n)
        img = Image.open(os.path.join('data/images', temp_data[0]))
        img=make_416_image(img)
        w,h=img.size
        # 这是进行resize操作前后的缩放比
        case=416/w
        img=img.resize((DATA_WIDTH,DATA_HEIGHT))
        # 将PLT的图片准成tensor
        img_data=tf(img)



        labels={}
        # 这是为了构造ground-truth相对于anchor的偏移的feature-map 用于损失计算
        # feature_size是对应的feature-map的尺寸 13 26 52
        # _anchors是对应尺寸下的3个anchor的[w,h]
        for feature_size,_antors in anchors.items():
            # 如13*13*3*(5+20)
            labels[feature_size]=np.zeros(shape=(feature_size,feature_size,3,5+CLASS_NUM))
            # 遍历图片中所有真实框
            for box in boxes:
                # 类别 中心点x 中心点y 宽 高 这是在原图上的真实框信息
                cls,cx,cy,w,h=box
                # 乘上缩放比 变为resize之后的新图上的真实框信息
                cx, cy,w,h=cx*case,cy*case,w*case,h*case

                # 求出这个真实框相对于图片左上角的偏移量 如(1.1,1.2)表示他在第1个网格中
                # math.modf函数计算一个数的整数部分和小数部分
                # 这个公式本来是 cx/(DATA_WIDTH/feature_size)的 然后一移动就变这样了
                # 其中 DATA_WIDTH/feature_size 就是每个网格的大小 cx/网格大小=几点几个网格
                # _x是小数部分，也就是真实框中心点在feature-map的框内偏移 x_index是整数部分，就是在第几个框
                _x,x_index=math.modf(cx*feature_size/DATA_WIDTH)
                _y,y_index=math.modf(cy*feature_size/DATA_HEIGHT)
                # 遍历所有anchor(先验框)
                for i,anchor in enumerate(_antors):
                    # 真实框的面积 和anchor的面积 采用小面积除以大面积的方法计算IOU
                    # IOU是应该是两个框的交集/两个框的并集 但是一般真实框与anchor是小框套大框的 所以可以直接这样来
                    area=w*h
                    # iou=min(area,ANCHORS_AREA[feature_size][i])/max(area,ANCHORS_AREA[feature_size][i])
                    # 真实框相对于anchor的宽高偏移量 因为w/anchor_w=e^偏移
                    p_w, p_h = np.log(w / anchor[0]), np.log(h / anchor[1])
                    labels[feature_size][int(y_index),int(x_index),i]=np.array([1,_x,_y,p_w,p_h,*one_hot(CLASS_NUM,int(cls))])
        return labels[13],labels[26],labels[52],img_data

if __name__ == '__main__':
    dataset=YoloDataSet()
    print(dataset[0][3].shape)
    print(dataset[0][0].shape)
    print(dataset[0][1].shape)
    print(dataset[0][2].shape)

util.py

from PIL import Image
import torch
from torchvision import transforms


tf=transforms.Compose([
    transforms.ToTensor()
])

# 将m*n的图片 变为l*l的图片 其中 l=max(m,n)
# 实现方式 创建l*l的黑色空白图片 然后把我们原来的m*n的图片贴到左上角
# 贴在左上角的原因 因为图片坐标就是以左上角建系的
def make_416_image(img):
    w,h=img.size[0],img.size[1]
    temp=max(w,h)
    mask=Image.new(mode='RGB',size=(temp,temp),color=(0,0,0))
    mask.paste(img,(0,0))
    return mask

# 计算某个边框和其余边框的IOU
# box=[x1,y1,x2,y2],boxes=[box1,box2...]
# ismin是否采用最小值
def iou(box,boxes,ismin=True):
    # 计算该框的面积
    box_area=(box[2]-box[0])*(box[3]-box[1])
    # 计算其他框的面积 可以用for实现
    boxex_area=(boxes[:,2]-boxes[:,0])*(boxes[:,3]-boxes[:,1])

    # 求交集矩形的坐标
    xx1=torch.maximum(box[0],boxes[:,0])
    yy1=torch.maximum(box[1],boxes[:,1])
    xx2=torch.minimum(box[2],boxes[:,2])
    yy2=torch.minimum(box[3],boxes[:,3])
    # 交集矩形的宽和高 因为可能没有交集
    w, h = torch.maximum(torch.tensor([0]), xx2 - xx1), torch.maximum(torch.tensor([0]), yy2 - yy1)
    over_area = w * h

    if ismin:
        return over_area / torch.minimum(box_area, boxex_area)
    return over_area / (box_area + boxex_area - over_area)

# box=[score,x1,y1,x2,y2],
def nms(boxes,threshold=0.3):
    # 按照得分降序排序
    new_boxes=boxes[boxes[:,0].argsort(descending=True)]
    # 存储需要保留下来的框 刚开始肯定得分最大的保留下来
    keep_boxes=[]
    while len(new_boxes)>1:
        _box=new_boxes[0]
        _boxex=new_boxes[1:]
        keep_boxes.append(_box)

        # 接着使用这个得分最大的框和剩下的所有框进行IOU的计算 留下那些IOU小于阈值的框
        # torch.where 返回满足条件的索引
        new_boxes=_boxex[torch.where(iou(_box,_boxex)<threshold)]
    # 出循环的时候里面还剩一个
    if len(new_boxes) == 1:
        keep_boxes.append(new_boxes[0])
    # stack函数将 [tonsor[],tensor[]] -> tensor[[],[]]
    return torch.stack(keep_boxes)


if __name__ == '__main__':
    pass
    # mask=make_416_image('data/images/000261.jpg')
    # mask.show()

make_data_txt.py

import math
import xml.etree.cElementTree as et
import os
from config import class_num


# class_num = {
#     'mask': 0,
#     'face': 1
# }

xml_dir = 'data/image_voc'
# 函数返回文件夹下的所有文件名的列表
xml_filenames = os.listdir(xml_dir)
# 以下是为了在data.txt文件中写入 图像名 类别1 x y w h 类别2 x y w h
with open('data.txt', 'a') as f:
    for xml_filename in xml_filenames:
        xml_filename_path = os.path.join(xml_dir, xml_filename)
        tree = et.parse(xml_filename_path)
        root = tree.getroot()
        filename = root.find('filename')
        names = root.findall('object/name')
        boxes = root.findall('object/bndbox')

        data = []
        data.append(filename.text)
        for name, box in zip(names, boxes):
            # 类别
            cls = class_num[name.text]
            # 真实框的中心点坐标和宽高
            cx, cy, w, h = math.floor((int(box[2].text) - int(box[0].text)) / 2), math.floor(
                (int(box[3].text) - int(box[1].text)) / 2), int(box[2].text) - int(box[0].text), int(box[3].text) - int(
                box[1].text)
            data.append(cls)
            data.append(cx)
            data.append(cy)
            data.append(w)
            data.append(h)
        _str = ''
        for i in data:
            _str = _str + ' ' + str(i)
        f.write(_str + '\n')
f.close()

config.py


DATA_WIDTH=416
DATA_HEIGHT=416

CLASS_NUM=20
BATCH_SIZE=6


# anchors存的是每个尺寸对应的3个anchor的 [w,h]，因为宽和高重要 而位置是不重要的
# k-means聚类得到
# 算法简介：随机设置9个点，然后计算所有点到这9个点的距离 每个点属于离他最近的点
#         所有点计算完毕之后形成9团，再计算这9团的更优的中心点，然后重复上述过程，直到这9个点不再变化
#         这里采用的距离不是常见的欧氏距离而是 1-IOU 因为IOU越大代表两个框越接近，那么1-IOU就能代表他们的距离
anchors={
    13: [[168,302], [57,221], [336,284]],
    26: [[175,225], [279,160], [249,271]],
    52: [[129,209], [85,413], [44,42]]
}

ANCHORS_AREA={
    13: [x*y for x,y in anchors[13]],
    26: [x*y for x,y in anchors[26]],
    52: [x*y for x,y in anchors[52]]
}

num_class = {
    0: 'aeroplane',
    1: 'bicycle',
    2: 'bird',
    3: 'boat',
    4: 'bottle',
    5: 'bus',
    6: 'car',
    7: 'cat',
    8: 'chair',
    9: 'cow',
    10: 'diningtable',
    11: 'dog',
    12: 'horse',
    13: 'motorbike',
    14: 'person',
    15: 'pottedplant',
    16: 'sheep',
    17: 'sofa',
    18: 'train',
    19: 'tvmonitor'
}

class_num = {
    'aeroplane': 0,
    'bicycle': 1,
    'bird': 2,
    'boat': 3,
    'bottle': 4,
    'bus': 5,
    'car': 6,
    'cat': 7,
    'chair': 8,
    'cow': 9,
    'diningtable': 10,
    'dog': 11,
    'horse': 12,
    'motorbike': 13,
    'person': 14,
    'pottedplant': 15,
    'sheep': 16,
    'sofa': 17,
    'train': 18,
    'tvmonitor': 19
}

trainer.py

import os

from torch import nn, optim
import torch
from torch.utils.data import DataLoader
from dataset import *
# from yolo_v3_net import *
from torch.utils.tensorboard import SummaryWriter
from Yolo import *


# output是网络的预测输出 格式为(batch_size,3*(1+4+20),w,h) 要将他改成(batch_size,h,w,3,(1+4+20))
# 更改的原因 一方面是与dataset输出的格式一致 另一方面是这个格式方便操作最后的那些数据
# target是真实数据 也就是dataset中得到的数据  格式为(h,w, 3, (1+4+20))
# 因为正负样本不均匀 加入c 作为正负样本得均衡
def loss_fun(output, target, c):
    # (batch_size,3*(1+4+20),w,h)->(batch_size,w,h,3*(1+4+20))
    output = output.permute(0, 2, 3, 1)
    # (batch_size,w,h,3*(1+4+20))->(batch_size,w,h,3,(1+4+20))   -1表示自己计算
    output = output.reshape(output.size(0), output.size(1), output.size(2), 3, -1)

    # target是(w,h,3,1+4+20) 那么最后一个维度的第一个元素就是置信度 根据他分成正负样本 这里得到一个bool矩阵
    mask_obj = target[..., 0] > 0
    mask_no_obj = target[..., 0] == 0

    # 置信度损失是所有样本都需要的
    # 因为根据置信度分成两类 就i可以使用二分类交叉熵损失函数
    loss_p_fun = nn.BCELoss()
    # 网络输出需要归一化一下 因为置信度肯定是0-1得 但是网络输出不确定过的 真实标签不需要了
    loss_p = loss_p_fun(torch.sigmoid(output[..., 0]), target[..., 0])

    # 预测框的损失是均方差损失函数
    loss_box_fun = nn.MSELoss()
    # 取出输出中的真实样本 最后一个维度(1+4+20)的 1->4就是xywh就是框的情况
    loss_box = loss_box_fun(output[mask_obj][..., 1:5], target[mask_obj][..., 1:5])

    # 标签的预测损失 用多分类交叉熵损失函数
    loss_segment_fun = nn.CrossEntropyLoss()
    # 最后一个维度(1+4+20)的 5-> 就是类别的情况
    # 真实标签处理是因为我们做的时候弄了一个 onehot编码 但是nn.CrossEntropyLoss()是自带onehot编码的 得给他变回原来的
    # 但是变回来之后多了一个维度 再给她降维一下
    loss_segment = loss_segment_fun(output[mask_obj][..., 5:],
                                    torch.argmax(target[mask_obj][..., 5:], dim=1, keepdim=True).squeeze(dim=1))
    # 因为会有政府样本的不均衡 我们人为调节比重 使得置信度的损失占比大一点
    loss = c * loss_p + (1 - c) * 0.5 * loss_box + (1 - c) * 0.5 * loss_segment
    return loss


if __name__ == '__main__':
    summary_writer = SummaryWriter('logs')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    dataset = YoloDataSet()
    data_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

    weight_path = 'params/net.pt'
    net = yolo3_net().to(device)
    if os.path.exists(weight_path):
        net.load_state_dict(torch.load(weight_path, map_location=torch.device('cpu')))
        print("权重加载成功！")

    opt = optim.Adam(net.parameters())

    epoch = 500
    total_train_step = 0
    for i in range(epoch):
        for target_13, target_26, target_52, img_data in data_loader:
            target_13, target_26, target_52, img_data = target_13.to(device), target_26.to(device), target_52.to(
                device), img_data.to(device)

            output_13, output_26, output_52 = net(img_data)

            loss_13 = loss_fun(output_13.float(), target_13.float(), 0.7)
            loss_26 = loss_fun(output_26.float(), target_26.float(), 0.7)
            loss_52 = loss_fun(output_52.float(), target_52.float(), 0.7)

            loss = loss_13 + loss_26 + loss_52
            # 清空梯度 反向传播 更新参数
            opt.zero_grad()
            loss.backward()
            opt.step()
            total_train_step += 1
            if total_train_step % 100 == 0:
                print("第{}个epoch，总训练次数为{}时，损失为{}".format(i+1,total_train_step, loss.item()))
            summary_writer.add_scalar('train_loss', loss.item(), total_train_step)
        if (i + 1) % 100 == 0:
            torch.save(net.state_dict(), "params/net_{}_path.pt".format(i+1), _use_new_zipfile_serialization=False)
            print('模型保存成功')

detector.py

import os

import torch
from torch import nn
from Yolo import *
from PIL import Image, ImageDraw
from util import *
from dataset import *
from config import *


# class_num = {
#     0: 'mask',
#     1: 'face'
# }


class Detector(nn.Module):
    def __init__(self):
        super(Detector, self).__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.weights = 'params/net.pt'
        #self.weights = 'params/net.pt'
        self.net = yolo3_net().to(self.device)
        if os.path.exists(self.weights):
            self.net.load_state_dict(torch.load(self.weights, map_location=torch.device('cpu')))
            print('加载权重成功')
        # eval作用 表示是测试 停止dropout
        # 训练得时候可以不加net.train 但是测试的时候一定要加
        self.net.eval()

    # input 输入的图片数据, thresh 门限, anchors 先验框, case 缩放比
    def forward(self, input, thresh, anchors, case):
        output_13, output_26, output_52 = self.net(input)
        index_13, bias_13 = self.get_index_and_bias(output_13, thresh)
        boxes_13 = self.get_true_position(index_13, bias_13, 32, anchors[13], case)

        index_26, bias_26 = self.get_index_and_bias(output_26, thresh)
        boxes_26 = self.get_true_position(index_26, bias_26, 16, anchors[26], case)

        index_52, bias_52 = self.get_index_and_bias(output_52, thresh)
        boxes_52 = self.get_true_position(index_52, bias_52, 8, anchors[52], case)

        # 在第0个维度上将他们拼起来
        return torch.cat([boxes_13, boxes_26, boxes_52], dim=0)

    # 对于网络的输出 根据置信度筛选一批预测框 之后取得预测框的索引和偏移量
    # output 网络的输出, thresh 门限
    def get_index_and_bias(self, output, thresh):
        # output是网络的预测输出 格式为(batch_size,3*(1+4+20),w,h) 要将他改成(batch_size,h,w,3,(1+4+20))
        output = output.permute(0, 2, 3, 1)
        output = output.reshape(output.size(0), output.size(1), output.size(2), 3, -1)

        # 看看test.py
        # 根据置信度筛选 得到可信的输出
        mask = output[..., 0] >=thresh
        # 预测框的索引和偏移量
        # nonzero函数返回tensor中为true的元素的索引
        # index格式为(batch_size,h,w,3)
        index = mask.nonzero()
        # 找到所有可信的框
        bias = output[mask]

        return index, bias

    # 根据预测框索引和偏移计算 原图中的位置
    # index 索引 就是在第几个格子, bias 格子内的偏移,
    # t 网格对应的416*416中的大小, anchors 当前尺寸下的先验框, case 缩放比
    def get_true_position(self, index, bias, t, anchors, case):
        anchors = torch.Tensor(anchors).to(self.device)
        # index格式为(batch_size, h, w, 3)
        # 哪个尺寸的框
        a = index[:, 3]

        # 相当于求出在这个尺寸框里的网格坐标是(X.x,Y.y) 再乘上网格的大小 就是在416*416上的坐标了
        # 再除以缩放比就是原图的坐标
        cy = (index[:, 1].float() + bias[:, 2].float()) * t / case
        cx = (index[:, 2].float() + bias[:, 1].float()) * t / case

        # 对应anchor的w,h乘上预测框的e^w,h得到416*416上的w,h
        # 再除以缩放比就是框在原图上的w,h了
        w = anchors[a, 0] * torch.exp(bias[:, 3]) / case
        h = anchors[a, 1] * torch.exp(bias[:, 4]) / case

        # 置信度
        p = bias[:, 0]
        # 取出各个onehot编码
        cls_p = bias[:, 5:]
        # 得到这个框的预测类别
        cls_index = torch.argmax(cls_p, dim=1)

        return torch.stack([torch.sigmoid(p), cx, cy, w, h, cls_index], dim=1)


if __name__ == '__main__':
    detector = Detector()
    detector_img='images/test_img/000010.jpg'
    img=Image.open(detector_img)
    _img=make_416_image(img)
    temp = max(_img.size)
    # 缩放比
    case = 416 / temp
    _img = _img.resize((416, 416))
    _img = tf(_img).to(detector.device)
    # 需要输入的是4维的 图片是三维的 我们给他生个维
    _img = torch.unsqueeze(_img, dim=0)
    results = detector(_img, 0.5, anchors, case)
    draw = ImageDraw.Draw(img)

    for rst in results:
        # 根据中心点的坐标计算左上角和右下角的坐标
        x1, y1, x2, y2 = rst[1] - 0.5 * rst[3], rst[2] - 0.5 * rst[4], rst[1] + 0.5 * rst[3], rst[2] + 0.5 * rst[4]
        print(x1, y1, x2, y2)
        print('class', num_class[int(rst[5])])
        draw.text((x1, y1), str(num_class[int(rst[5].item())]) + str(rst[0].item())[:4])
        draw.rectangle((x1, y1, x2, y2), outline='red', width=1)
    img.show()

可乐大牛

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【目标检测】YOLO v3

目录1、论文原文2、论文思想概述改进1、论文原文链接，提取码：pema2、论文思想概述YOLOv3是单阶段目标检测算法YOLO的第三个版本，广泛用于工业界，它改进了骨干网络、正负样本选取和损失函数，并引入了FPN特征金字塔多尺度预测，显著提升了速度和精度。改进骨干网络：DarkNet53，就是YOLO v2中的DarkNet19+Resnet。darknet53是52个卷积层+1个全连接层，但是在COCO数据集上训练好了分类的网络之后，就去掉了后面的全连接层，单纯作为全卷积网络在YOLO模型中
复制链接

扫一扫