Yolov8学习笔记之代码解析、自定义数据制作、训练、推理。

NO郑

已于 2024-04-29 15:03:54 修改

阅读量1.6k

点赞数 46

文章标签： YOLO 计算机视觉

于 2024-04-28 10:54:10 首次发布

本文链接：https://blog.csdn.net/NOZhengYuan/article/details/138269348

版权

摘要：Yolov8学习笔记，详细内容参考目录。

主要讲解:1、模型结构、模块、计算、原理。2、自定义数据集制作、训练。4、训练。5、推理

github项目地址：YoloV8项目地址

目录：

一、网络结构代码解析

1.1 网络总结构

模型结构yaml文件路径： ./ultralytics/cfg/models/v8/yolov8.yaml

主要是每一层结构及参数，注释了每一层得输入、输出通道数以及尺寸大小

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs
  s: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPs
  m: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPs
  l: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
  x: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs

# from(输入位置，-1表示自身所在的特征层的位置.)，repeats（叠加层数），module（模块），args（[输出通道数，卷积核大小，步距，padding]）
# 其中c2f传两个参数，一个是输出通道，一个是shortcut，默认False，决定经过c2f后是否加上输入x
# 为True得时候：x + self.cv2(self.cv1(x)) ； 为False： self.cv2(self.cv1(x))
# YOLOv8.0n backbone  计算size的公式：out_size =（ in_size - k +2 * p）/ s + 1
backbone:
  # [from, repeats, module, args]                                         in_channel：3                                 size：640*640
  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2                                   in_channel：3      out_channel: 64            size：320*320
  - [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4                                  in_channel：64     out_channel: 128           size：160*160
  - [-1, 3, C2f, [128, True]]   # 2                                 *3    in_channel：128    out_channel: 128           size：160*160
  - [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8                                  in_channel：128    out_channel: 256           size：80*80
  - [-1, 6, C2f, [256, True]]   # 4                                 *6    in_channel：256    out_channel: 256           size：80*80
  - [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16                                 in_channel：256    out_channel: 512           size：40*40
  - [-1, 6, C2f, [512, True]]   # 6                                 *6    in_channel：512    out_channel: 512           size：40*40
  - [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32                                in_channel：512    out_channel: 1024          size：20*20
  - [-1, 3, C2f, [1024, True]]   # 8                                *3    in_channel：1024   out_channel: 1024          size：20*20
  - [-1, 1, SPPF, [1024, 5]]  # 9                                         in_channel：1024   out_channel: 1024          size：20*20

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]   # 10                                                                   size：40*40
  - [[-1, 6], 1, Concat, [1]]  # cat backbone P4   11                     in_channel：1024   out_channel: 1536          size：40*40
  - [-1, 3, C2f, [512]]                          # 12                     in_channel：1536   out_channel: 512           size：40*40

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]   # 13                                                                   size：80*80
  - [[-1, 4], 1, Concat, [1]]  # cat backbone P3   14                     in_channel：1536   out_channel: 768           size：80*80
  - [-1, 3, C2f, [256]]  # 15 (P3/8-small)         15                     in_channel：768    out_channel: 256           size：80*80

  - [-1, 1, Conv, [256, 3, 2]]                   # 16                     in_channel：256    out_channel: 256           size：80*80
  - [[-1, 12], 1, Concat, [1]]  # cat head P4      17                     in_channel：256    out_channel: 768            size：320*320
  - [-1, 3, C2f, [512]]  # 18 (P4/16-medium)       18                     in_channel：768    out_channel: 512            size：320*320

  - [-1, 1, Conv, [512, 3, 2]]                   # 19                     in_channel：512    out_channel: 512            size：320*320
  - [[-1, 9], 1, Concat, [1]]  # cat head P5       20                     in_channel：512    out_channel: 1536           size：320*320
  - [-1, 3, C2f, [1024]]  # 21 (P5/32-large)       21                     in_channel：1536   out_channel: 1024            size：320*320

  - [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)

1.2 Backbone

1.2.1 Conv（CBS）

Conv代码路径： ./ultralytics/nn/modules/conv.py

作用：该模块主要用于提取特征、整理特征图。主要由Conv2d、BatchNorm2d、SiLu组成。

Conv2d：在backbone中Conv模块的stride全部为2，kernel均为3。因此Conv每次会将特征图的宽高减半，下采样特征图，同时提取到目标特征。
BatchNorm2d：归一化层。
SiLu激活函数：SiLU的激活是通过sigmoid函数乘以其输入来计算的，即xσ(x)。

其中：forward训练，forward_fuse推理。

class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""

    default_act = nn.SiLU()  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))

1.2.2 C2f

C2f代码路径：./ultralytics/nn/modules/block.py

作用：增强了backbone的学习能力、消除了计算瓶颈、降低了内存成本。

c2f中spilt将卷积后的特征图（h,w,c）中的通道数c分割成两半。
Bottleneck之前的特征图（deature map）和Bottleneck之后的特征图进行concat拼接操作。这叫做残差连接。
n个Bottleneck穿行，并将每一个Bottleneck都和最后的Bottleneck拼接起来，类似于完成了特征融合。

class C2f(nn.Module):
    """Faster Implementation of CSP Bottleneck with 2 convolutions."""

    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):
        """Initialize CSP bottleneck layer with two convolutions with arguments ch_in, ch_out, number, shortcut, groups,
        expansion.
        """
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        """Forward pass through C2f layer."""
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        """Forward pass using split() instead of chunk()."""
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))
        
class Bottleneck(nn.Module):
    """Standard bottleneck."""

    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):
        """Initializes a bottleneck module with given input/output channels, shortcut option, group, kernels, and
        expansion.
        """
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        """'forward()' applies the YOLO FPN to input data."""
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

1.2.3 SPPF

SPPF路径：./ultralytics/nn/modules/block.py

作用：提高模型处理在不同输入尺寸时的速度和精度。将将不同尺度的特征图转为尺度上。

class SPPF(nn.Module):
    """Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher."""

    def __init__(self, c1, c2, k=5):
        """
        Initializes the SPPF layer with given input/output channels and kernel size.

        This module is equivalent to SPP(k=(5, 9, 13)).
        """
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        """Forward pass through Ghost Convolution block."""
        x = self.cv1(x)
        y1 = self.m(x)
        y2 = self.m(y1)
        return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))

1.4 Neck

PAN-FPN：是指的在在YOLOv8的neck部分模仿PANet里面的backbone，也就是FPN网络（feature pyramid network）进行组织。PAN-FPN这个网络的特点进行简单概括，就是先下采样，然后再上采样，上采样和下采样的这两个分支之间还有两个跨层融合连接。（也可以反过来，先上采样，再下采样）

为什么：从model的yaml文件中可知：在backbone后，通过Upsample上采样的方式，向特征图中插值，使特征图的尺度变大，以便于融合来自backbone的特征图（图像channl减少，size增加）。与在backbone相同尺寸层进行拼接，分别是6、4、12、9。使浅层的图形特征与深层的语义特征做更好的融合，而不仅仅是简单的concat。

作用：就是将浅层的图形特征和深层的语义特征相结合，以获取更为完整的特征。

head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, C2f, [512]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 3, C2f, [256]] # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]] # cat head P4
  - [-1, 3, C2f, [512]] # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P5
  - [-1, 3, C2f, [1024]] # 21 (P5/32-large)

  - [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)

1.4.1 Detect

代码路径：./ultralytics/nn/modules/head.py

检测头作用：主要负责在图像或特征图上执行目标检测的任务，将输入图像或特征图划分成多个区域，并为每个区域生成检测框和对应的类别概率。

class Detect(nn.Module):
    """YOLOv8 Detect head for detection models."""

    dynamic = False  # force grid reconstruction
    export = False  # export mode
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init

    def __init__(self, nc=80, ch=()):
        """Initializes the YOLOv8 detection layer with specified number of classes and channels."""
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
        )
        self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
        if self.training:  # Training path
            return x

        # Inference path
        shape = x[0].shape  # BCHW
        x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
        if self.dynamic or self.shape != shape:
            self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
            self.shape = shape

        if self.export and self.format in ("saved_model", "pb", "tflite", "edgetpu", "tfjs"):  # avoid TF FlexSplitV ops
            box = x_cat[:, : self.reg_max * 4]
            cls = x_cat[:, self.reg_max * 4 :]
        else:
            box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
        dbox = self.decode_bboxes(box)

        if self.export and self.format in ("tflite", "edgetpu"):
            # Precompute normalization factor to increase numerical stability
            # See https://github.com/ultralytics/ultralytics/issues/7371
            img_h = shape[2]
            img_w = shape[3]
            img_size = torch.tensor([img_w, img_h, img_w, img_h], device=box.device).reshape(1, 4, 1)
            norm = self.strides / (self.stride[0] * img_size)
            dbox = dist2bbox(self.dfl(box) * norm, self.anchors.unsqueeze(0) * norm[:, :2], xywh=True, dim=1)

        y = torch.cat((dbox, cls.sigmoid()), 1)
        return y if self.export else (y, x)

    def bias_init(self):
        """Initialize Detect() biases, WARNING: requires stride availability."""
        m = self  # self.model[-1]  # Detect() module
        # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1
        # ncf = math.log(0.6 / (m.nc - 0.999999)) if cf is None else torch.log(cf / cf.sum())  # nominal class frequency
        for a, b, s in zip(m.cv2, m.cv3, m.stride):  # from
            a[-1].bias.data[:] = 1.0  # box
            b[-1].bias.data[: m.nc] = math.log(5 / m.nc / (640 / s) ** 2)  # cls (.01 objects, 80 classes, 640 img)

    def decode_bboxes(self, bboxes):
        """Decode bounding boxes."""
        return dist2bbox(self.dfl(bboxes), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides

二、自定义数据集制作

2.1 数据集路径

原数据路径：ultralytics/cfg/datasets

创建自定义数据的yaml文件，放在ultralytics/cfg/datasets下面

2.2 编译data的yaml文件

Path：自定义数据集路径：里面包含两个个文件：images、labels，image中放图片（包含train、test、val三个文件），labels中放标签（包含train、test、val三个文件）

train：训练集（图片训练集的路径，在image下）

val：验证集（图片测试集的路径）

test：测试集

2.2 标签制作

通过labelimg或者labelme标注图片

labelme地址

labelimg地址

2.3 标签格式转换

由于YOLO的标签是txt格式，需转换label格式为txt。

此处提供XML转TXT脚本

import os
from xml.dom.minidom import parse
from tqdm import tqdm

def set_txt(root_path, img_root_path):
    txt_root_path = root_path + "/labels"
    if not os.path.exists(txt_root_path):
        os.makedirs(txt_root_path)

    imgs_name = os.listdir(img_root_path)
    i = 0
    for img_name in imgs_name:
        i += 1
        txtname = img_name.split('.')[0] + ".txt"
        txt_path =os.path.join(txt_root_path, txtname)
        if not os.path.exists(txt_path):
            open(txt_path, 'w').close()
    print("txt numbel: ", i)

def xml_to_txt(xml_root_path, classes):
    xmls_name = os.listdir(xml_root_path)
    numbel = 0
    for xml_name in tqdm(xmls_name):
        numbel += 1
        xml_path = os.path.join(xml_root_path, xml_name)
        txt_name = xml_name[:-4] + ".txt"
        txt_root_path = os.path.join(root_path, "labels")
        if not os.path.exists(txt_root_path):
            os.makedirs(txt_root_path)
        txt_path = os.path.join(txt_root_path, txt_name)

        dom = parse(xml_path)
        root = dom.documentElement
        width = root.getElementsByTagName("width")[0].childNodes[0].nodeValue
        hidth = root.getElementsByTagName("height")[0].childNodes[0].nodeValue

        objs = root.getElementsByTagName('object')

        with open(txt_path, "w") as file:
            for obj in objs:
                label = obj.getElementsByTagName("name")[0].childNodes[0].nodeValue
                xmin = obj.getElementsByTagName("xmin")[0].childNodes[0].nodeValue
                ymin = obj.getElementsByTagName("ymin")[0].childNodes[0].nodeValue
                xmax = obj.getElementsByTagName("xmax")[0].childNodes[0].nodeValue
                ymax = obj.getElementsByTagName("ymax")[0].childNodes[0].nodeValue

                x_center = (int(xmin) + int(xmax)) / (2 * int(width))
                y_center = (int(ymin) + int(ymax)) / (2 * int(hidth))

                w = (int(xmax) - int(xmin)) / int(width)
                h = (int(ymax) - int(ymin)) / int(hidth)

                index = classes.index(label)
                # if label == 'warning_signsw':
                #     print(xml_name)
                file.write(str(index) + " " + str(x_center) + " " + str(y_center) + " " + str(w) + " " + str(h) + "\n")

    print(numbel)

if __name__ == "__main__":
    classes = ['类别1'， '类别2']
    root_path = r"根路径"                # 根路径需要包含images,val
    img_root_path = os.path.join(root_path, "images")
    xml_root_path = os.path.join(root_path, "val")

    set_txt(root_path, img_root_path)
    xml_to_txt(xml_root_path, classes)

三、训练

if __name__ == '__main__':   
    # Load a model
    model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)

    # Train the model with 2 GPUs
    results = model.train(data='ultralytics/cfg/my_data/huangjinbu.yaml', epochs=2,             imgsz=640, device=[0], batch=4)

参数名	参数值	注释
model	None	模型文件的路径，即yolov8n.pt,yolov8n.yaml
data	None	数据文件的路径，即 coco128.yaml
epochs	100	训练的数
time	None	训练的小时数，如果已提供，则覆盖纪元数
patience	50	在没有明显改善的情况下，提前停止训练的等待时间
batch	16	每批图像数（-1 表示自动批次）
imgsz	640	输入图像的整数大小
save	True	保存列车检查点并预测结果
save_period	-1	Save checkpoint every x epochs (disabled if < 1)
cache	False	true/ram，磁盘或 False。使用缓存加载数据
device	None	设备上运行，即 cuda device=0 或 device=0,1,2,3 或 device=cpu
workers	8	加载数据的工作线程数（如果是 DDP，则为每个 RANK）
project	None	项目名称
name	None	实验名称
exist_ok	False	是否覆盖现有实验
pretrained	True	(bool或str）是使用预训练模型（bool）还是使用模型加载权重（str）
optimizer	'auto'	使用的优化器，选择=[SGD、Adam、Adamax、AdamW、NAdam、RAdam、RMSProp、auto]
verbose	False	是否打印冗余输出
seed	0	随机种子
deterministic	True	是否启用确定性模式
single_cls	False	将多类数据训练为单类数据
rect	False	矩形训练，每批都经过整理，以减少填充物
cos_lr	False	使用余弦学习率调度器
close_mosaic	10	(int)禁用最后历时的马赛克增强（0 表示禁用）。
resume	False	从上一个检查站恢复训练
amp	True	自动混合精度 (AMP) 训练，选择=[真，假］
fraction	1.0	要训练的数据集分数（默认为 1.0，训练集中的所有图像）
profile	False	记录仪训练期间的ONNX 和TensorRT 速度剖面图
freeze	None	(int 或 list，可选）在训练过程中冻结前 n 层，或冻结层索引列表
lr0	0.01	初始学习率（即 SGD=1E-2, Adam=1E-3）
lrf	0.01	最终学习率 (lr0 * lrf)
momentum	0.937	新元动量/亚当贝塔1
weight_decay	0.0005	优化器权重衰减 5e-4
warmup_epochs	3.0	预热时间（可以是分数）
warmup_momentum	0.8	热身初始动力
warmup_bias_lr	0.1	预热初始偏置
box	7.5	箱内损失收益
cls	0.5	CLS 损耗增益（按像素缩放）
dfl	1.5	DFL 损失收益
pose	12.0	姿势损失增益（仅姿势）
kobj	2.0	关键点对象损失增益（仅姿势）
label_smoothing	0.0	标签平滑（分数）
nbs	64	标称批量
overlap_mask	True	遮罩应在训练期间重叠（仅分段训练）
mask_ratio	4	掩码下采样率（仅限段列车）
dropout	0.0	使用 dropout 正则化（仅对火车进行分类）
val	True	在培训期间验证/测试
plots	False	在训练/评估过程中保存绘图和图像

四、推理

推理参数根据模型与权重选择检测、分割、关键点。

名称	类型	默认值	说明
source	str	'ultralytics/assets'	图像或视频的源目录
conf	float	0.25	检测对象置信度阈值
iou	float	0.7	NMS 的 "相交大于结合"（IoU）阈值
imgsz	int or tuple	640	图像尺寸标量或（高，宽）列表，即（640，480）
half	bool	False	使用半精度 (FP16)
device	None or str	None	设备上运行，如 cuda device=0/1/2/3 或 device=cpu
max_det	int	300	每幅图像的最大检测次数
vid_stride	bool	False	视频帧速率跨度
stream_buffer	bool	False	缓冲所有流媒体帧（真）或返回最新帧（假）
visualize	bool	False	可视化模型特征
augment	bool	False	对预测源进行图像增强
agnostic_nms	bool	False	不分等级的 NMS
classes	list[int]	None	按类别筛选结果，即 classes=0，或 classes=[0,2,3]
retina_masks	bool	False	使用高分辨率分割掩膜
embed	list[int]	None	返回给定层的特征向量/嵌入值