关键点检测,作为计算机视觉领域的重要分支,广泛应用于人体姿态估计、面部表情识别、手部动作分析等多个场景。其核心在于从图像中准确检测并定位特定的关键点位置。然而,目前存在单阶段和双阶段的关键点检测算法。
我们之前了解到RTMO是使用YOLO系列作为backbone的,同时yolo系列也是著名的单阶段目标检测,语义分割,关键点检测网络。所以我这里先了解一下yolo,而最开始就是经典的ultralytics。那么下面学习一下其ultralytics的backbone是如何提取特征的。我会先以yolov8-pose的网络结构为例进行展示,然后再学习其backbone如何搭建。
1,yolov8backbone的代码分析
1.1 yolov8的backbone网络结构打印
要学习yolo系列的模型,要想为自己所用,就要熟悉。这次想打印并查看yolov8网络模型的结构配置,并手动实现一下。所以记录一下自己如何查看config配置的网络,并手写模型的笔记。
yolov8大家都很常用,一般在模型训练之前,我们都会打印其相应的模型结构信息,并且学习其原理。那么如何打印yolov8的呢。
比如我们以yolo-pose为例。代码信息是在源码的tasks.py里面,具体如下:
我们展开PoseModel,代码如下:
很明显我们可以直接调用PoseModel,并且传入自己的模型配置文件,运行该类即可。我直接在tasks.py文件下面运行了。代码如下:
if __name__ == "__main__":
yaml_path = r"D:\ultralytics\cfg\models\v8\yolov8-pose.yaml"
# nc 表示模型检测类别数量
model = PoseModel(cfg=yaml_path, nc=10)
运行结果如下:
可以看到模型配置文件一共有23行,params为每一层的参数量大小。module为每一层的结构名称。arguments为每一层结构需要传入的参数。最后一行summary为总的信息参数,模型一共有250层,参数来为3297225。计算量GFLOPs为 9.3.
上面只是打印了网络配置文件每一层相关的信息,如果我们想要更加细致的每一步信息,可以直接使用model.info()来查看。代码如下。(注意只能用于分析yolo系列,其他网络可能没有info的属性)
加载训练好的模型或者网络结构配置文件
from ultralytics import YOLO
# 加载训练好的模型或者网络结构配置文件
model = YOLO('ultralytics\weights\yolov8n-pose.pt')
# model = YOLO('ultralytics/cfg/models/v8/yolov8n.yaml')
打印模型参数信息:
# 打印模型参数信息
print(model.info())
其结果如下:
打印模型每一层结构信息:在上面代码中加入detailed参数即可
print(model.info(detailed=True))
其结果如下(我这里截取了部分):
可以看到,打印出了模型每一层网络结构的名字、参数量以及该层的结构形状。
本文方法同样适用于ultralytics
框架的其他模型结构,使用方法相同,可用于不同模型进行参数量、计算量等对比使用。
我们也可以直接打印模型的格式,如下:
YOLO(
(model): PoseModel(
(model): Sequential(
(0): Conv(
(conv): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): C2f(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(48, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(3): Conv(
(conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(4): C2f(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0-1): 2 x Bottleneck(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(5): Conv(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(6): C2f(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0-1): 2 x Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(7): Conv(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(8): C2f(
(cv1): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(9): SPPF(
(cv1): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): MaxPool2d(kernel_size=5, stride=1, padding=2, dilation=1, ceil_mode=False)
)
(10): Upsample(scale_factor=2.0, mode='nearest')
(11): Concat()
(12): C2f(
(cv1): Conv(
(conv): Conv2d(384, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(192, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(13): Upsample(scale_factor=2.0, mode='nearest')
(14): Concat()
(15): C2f(
(cv1): Conv(
(conv): Conv2d(192, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(96, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(16): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(17): Concat()
(18): C2f(
(cv1): Conv(
(conv): Conv2d(192, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(192, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(19): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(20): Concat()
(21): C2f(
(cv1): Conv(
(conv): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
)
)
)
(22): Pose(
(cv2): ModuleList(
(0): Sequential(
(0): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
)
(1): Sequential(
(0): Conv(
(conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
)
(2): Sequential(
(0): Conv(
(conv): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
)
)
(cv3): ModuleList(
(0): Sequential(
(0): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))
)
(1): Sequential(
(0): Conv(
(conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))
)
(2): Sequential(
(0): Conv(
(conv): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))
)
)
(dfl): DFL(
(conv): Conv2d(16, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
(cv4): ModuleList(
(0): Sequential(
(0): Conv(
(conv): Conv2d(64, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(51, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(51, 51, kernel_size=(1, 1), stride=(1, 1))
)
(1): Sequential(
(0): Conv(
(conv): Conv2d(128, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(51, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(51, 51, kernel_size=(1, 1), stride=(1, 1))
)
(2): Sequential(
(0): Conv(
(conv): Conv2d(256, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(51, 51, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(51, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
(act): SiLU(inplace=True)
)
(2): Conv2d(51, 51, kernel_size=(1, 1), stride=(1, 1))
)
)
)
)
)
)
当我们使用print(model)时,看到的是模型的结构描述,这是pytorch的Module类的默认打印行为。它递归地显示了模型的层次结构,包括了子模块和他们的属性。
通过这种方式打印模型,你可以检查模型结构是否符合预期,确保所有必要的层都已包含,并且参数设置正确。这对于调试和模型设计是非常有用的。如果你在加载权重时遇到问题,检查模型的打印输出可以帮助你确定模型结构是否与权重文件中保存的结构相匹配。
1.2 yolov8的yaml配置文件分析
上面我们打印模型使用了两种方式,一种是yolov8-pose.yaml
一种是模型的预训练权重 yolov8n-pose.pt
。
我们这里先学习一下其配置文件,我这里贴一个yolov8-pose.yaml
文件:
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8-pose keypoints/pose estimation model. For Usage examples see https://docs.ultralytics.com/tasks/pose
# Parameters
nc: 1 # number of classes
kpt_shape: [17, 3] # number of keypoints, number of dims (2 for x,y or 3 for x,y,visible)
scales: # model compound scaling constants, i.e. 'model=yolov8n-pose.yaml' will call yolov8-pose.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024]
s: [0.33, 0.50, 1024]
m: [0.67, 0.75, 768]
l: [1.00, 1.00, 512]
x: [1.00, 1.25, 512]
# YOLOv8.0n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
# YOLOv8.0n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 12
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 15 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 12], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 18 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 9], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [1024]] # 21 (P5/32-large)
- [[15, 18, 21], 1, Pose, [nc, kpt_shape]] # Pose(P3, P4, P5)
从上面代码可以看出,其包含了类别数,模型尺寸,骨干和头部结构。这些配置决定了模型的性能和复杂性。
简单解释一下:
- nc:类别数目,即number of classes,模型用于检测的对象类别总数,1表示该模型配置用于检测1种不同的对象
- kpt_shape:关键点的数量,即 number of keypoints。[17, 3]表示有17个关节点,对于一个关节点有三个值表示其x, y坐标和v(v就是visible,是否可见)
- scale:模型复合缩放常数,用于定义模型的不同尺寸和复杂性。把比如我这里是
yolov8n-pose.yaml
,那么调用的就是n: [0.33, 0.25, 1024] - backbone:主干网络是模型的基础,负责从输入图像中提取特征。这些特征是后续网络层进行目标检测的基础。其中[from, repeats, module, args] 表示层的来源,重复次数,模块类型和参数。 from表示该模型的输入来源,-1表示来自于上一个模块的输出,如果为其他具体的值则表示从特定模块中得到的输入信息;repeats表示指定一个模块或层应该重复的次数。如果你想让某个卷积层重复三次,则可以使用repeats=3。module表示只当要添加的模块或层的类型。args即表示用于传给模块或层的特定参数。
- head:定义了模型的检测头,即用于最终目标检测的网络结构。
1.3 yolov8如何通过yaml文件加载backbone
我们还是从源头进行check。将如何加载backbone梳理一遍。
首先很明显,代码进入了yaml_model_load函数里面。
我们check其func:
代码也很简单,我们重点关注返回值d。我们可以直接打印如下:
{'nc': 1,
'kpt_shape': [17, 3],
'scales': {'n': [0.33, 0.25, 1024], 's': [0.33, 0.5, 1024], 'm': [0.67, 0.75, 768], 'l': [1.0, 1.0, 512], 'x': [1.0, 1.25, 512]},
'backbone': [[-1, 1, 'Conv', [64, 3, 2]], [-1, 1, 'Conv', [128, 3, 2]], [-1, 3, 'C2f', [128, True]], [-1, 1, 'Conv', [256, 3, 2]], [-1, 6, 'C2f', [256, True]], [-1, 1, 'Conv', [512, 3, 2]], [-1, 6, 'C2f', [512, True]], [-1, 1, 'Conv', [1024, 3, 2]], [-1, 3, 'C2f', [1024, True]], [-1, 1, 'SPPF', [1024, 5]]],
'head': [[-1, 1, 'nn.Upsample', ['None', 2, 'nearest']], [[-1, 6], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'nn.Upsample', ['None', 2, 'nearest']], [[-1, 4], 1, 'Concat', [1]], [-1, 3, 'C2f', [256]], [-1, 1, 'Conv', [256, 3, 2]], [[-1, 12], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'Conv', [512, 3, 2]], [[-1, 9], 1, 'Concat', [1]], [-1, 3, 'C2f', [1024]], [[15, 18, 21], 1,
'Pose', ['nc', 'kpt_shape']]],
'scale': '',
'yaml_file': 'D:\\\ultralytics\\ultralytics\\cfg\\models\\v8\\yolov8-pose.yaml'}
就是将yaml解析出来,如果对如何解析刚兴趣的话,可以查看 yaml_load 函数。代码如下:
然后我们继续回到PoseModel里面。我们拿到了cfg文件内容了。然后要对其进行初始化,我们发现其继承DetectionModel类,然后我们进去很容易找到定义模型的地方了:
我们进入 parse_model 函数里:
def parse_model(d, ch, verbose=True): # model_dict, input_channels(3)
"""Parse a YOLO model.yaml dictionary into a PyTorch model."""
import ast
# Args
max_channels = float("inf")
nc, act, scales = (d.get(x) for x in ("nc", "activation", "scales"))
depth, width, kpt_shape = (d.get(x, 1.0) for x in ("depth_multiple", "width_multiple", "kpt_shape"))
if scales:
scale = d.get("scale")
if not scale:
scale = tuple(scales.keys())[0]
LOGGER.warning(f"WARNING ⚠️ no model scale passed. Assuming scale='{scale}'.")
depth, width, max_channels = scales[scale]
if act:
Conv.default_act = eval(act) # redefine default activation, i.e. Conv.default_act = nn.SiLU()
if verbose:
LOGGER.info(f"{colorstr('activation:')} {act}") # print
if verbose:
LOGGER.info(f"\n{'':>3}{'from':>20}{'n':>3}{'params':>10} {'module':<45}{'arguments':<30}")
ch = [ch]
layers, save, c2 = [], [], ch[-1] # layers, savelist, ch out
for i, (f, n, m, args) in enumerate(d["backbone"] + d["head"]): # from, number, module, args
m = getattr(torch.nn, m[3:]) if "nn." in m else globals()[m] # get module
for j, a in enumerate(args):
if isinstance(a, str):
with contextlib.suppress(ValueError):
args[j] = locals()[a] if a in locals() else ast.literal_eval(a)
n = n_ = max(round(n * depth), 1) if n > 1 else n # depth gain
if m in {
Classify,
Conv,
ConvTranspose,
GhostConv,
Bottleneck,
GhostBottleneck,
SPP,
SPPF,
DWConv,
Focus,
BottleneckCSP,
C1,
C2,
C2f,
RepNCSPELAN4,
ELAN1,
ADown,
AConv,
SPPELAN,
C2fAttn,
C3,
C3TR,
C3Ghost,
nn.ConvTranspose2d,
DWConvTranspose2d,
C3x,
RepC3,
PSA,
SCDown,
C2fCIB,
}:
c1, c2 = ch[f], args[0]
if c2 != nc: # if c2 not equal to number of classes (i.e. for Classify() output)
c2 = make_divisible(min(c2, max_channels) * width, 8)
if m is C2fAttn:
args[1] = make_divisible(min(args[1], max_channels // 2) * width, 8) # embed channels
args[2] = int(
max(round(min(args[2], max_channels // 2 // 32)) * width, 1) if args[2] > 1 else args[2]
) # num heads
args = [c1, c2, *args[1:]]
if m in {BottleneckCSP, C1, C2, C2f, C2fAttn, C3, C3TR, C3Ghost, C3x, RepC3, C2fCIB}:
args.insert(2, n) # number of repeats
n = 1
elif m is AIFI:
args = [ch[f], *args]
elif m in {HGStem, HGBlock}:
c1, cm, c2 = ch[f], args[0], args[1]
args = [c1, cm, c2, *args[2:]]
if m is HGBlock:
args.insert(4, n) # number of repeats
n = 1
elif m is ResNetLayer:
c2 = args[1] if args[3] else args[1] * 4
elif m is nn.BatchNorm2d:
args = [ch[f]]
elif m is Concat:
c2 = sum(ch[x] for x in f)
elif m in {Detect, WorldDetect, Segment, Pose, OBB, ImagePoolingAttn, v10Detect}:
args.append([ch[x] for x in f])
if m is Segment:
args[2] = make_divisible(min(args[2], max_channels) * width, 8)
elif m is RTDETRDecoder: # special case, channels arg must be passed in index 1
args.insert(1, [ch[x] for x in f])
elif m is CBLinear:
c2 = args[0]
c1 = ch[f]
args = [c1, c2, *args[1:]]
elif m is CBFuse:
c2 = ch[f[-1]]
else:
c2 = ch[f]
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace("__main__.", "") # module type
m.np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type = i, f, t # attach index, 'from' index, type
if verbose:
LOGGER.info(f"{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}") # print
save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
layers.append(m_)
if i == 0:
ch = []
ch.append(c2)
return nn.Sequential(*layers), sorted(save)
至此我们就知道了yolov8如何构建模型。而我这里只需要backbone的内容,那么我只打印backbone,则代码如下:
yaml_content = {'nc': 1,
'kpt_shape': [17, 3],
'scales': {'n': [0.33, 0.25, 1024],
's': [0.33, 0.5, 1024],
'm': [0.67, 0.75, 768],
'l': [1.0, 1.0, 512],
'x': [1.0, 1.25, 512]},
'backbone': [[-1, 1, 'Conv', [64, 3, 2]],
[-1, 1, 'Conv', [128, 3, 2]],
[-1, 3, 'C2f', [128, True]],
[-1, 1, 'Conv', [256, 3, 2]],
[-1, 6, 'C2f', [256, True]],
[-1, 1, 'Conv', [512, 3, 2]],
[-1, 6, 'C2f', [512, True]],
[-1, 1, 'Conv', [1024, 3, 2]],
[-1, 3, 'C2f', [1024, True]],
[-1, 1, 'SPPF', [1024, 5]]],
'head':[],
'yaml_file': 'D:ultralytics\\ultralytics\\cfg\\models\\v8\\yolov8-pose.yaml'}
model, save = parse_model(yaml_content, ch=3, verbose=True)
print(model)
打印结构如下:
Sequential(
(0): Conv(
(conv): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(1): Conv(
(conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(2): C2f(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(48, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
)
)
)
(3): Conv(
(conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(4): C2f(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(m): ModuleList(
(0-1): 2 x Bottleneck(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
)
)
)
(5): Conv(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(6): C2f(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(m): ModuleList(
(0-1): 2 x Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
)
)
)
(7): Conv(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(8): C2f(
(cv1): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(m): ModuleList(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
)
)
)
(9): SPPF(
(cv1): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): SiLU()
)
(m): MaxPool2d(kernel_size=5, stride=1, padding=2, dilation=1, ceil_mode=False)
)
)
至此通过yaml文件加载backbone完成。
yolov8的分析完了,下面就接着来自己手写了。
2,手写yolov8的backbone代码
2.1根据模型梳理backbone结构
上面我们已经打印了模型的结构层次,然后按照其backbone的结构格式手写即可。我这直接盗取MMYOLO里面手画的图做参考:
因为我这一节只关注backbone,所以我只截取了backbone相关的内容。从上面可以看到,我们也只需要关注几个重要的模块,比如C2F和SPPF等。
所以说backbone总共是图中的1——9层。即stem卷积层到SPPF层。分别是:
- Conv+Conv+C2f
- Conv+C2f(对应特征金字塔P3)
- Conv+C2f(对应特征金字塔P4)
- Conv+C2f+SFFP(对应特征金字塔P5)
2.2 backbone的组件1——Conv
Conv是标准的卷积操作,注意yolov8的tensor类型的数据结构和torch.nn.Conv2d相同,均为(N,C,H,W),N为批大小,C为通道数,H为高度,W为宽度。
代码如下:
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
"""Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initialize Conv layer with given arguments including activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Apply convolution, batch normalization and activation to input tensor."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Perform transposed convolution of 2D data."""
return self.act(self.conv(x))
注意这里的c2(卷积核数量,即输出通道数量),根据模型大小设定不同,对应不同的乘数,可以看框架图中右上表格 width_multiple 栏。例如,第一个卷积层,图中标示为 320 x 320 x 64 x w
,对于 s 模型,w=0.5,故最终的卷积核数量为 32(示例输出 1 中 arguments 的第 2 个参数)
2.3 backbone的组件2——Bottleneck
我们截取其框架图中的Bottleneck,结构非常简单如下:为两个卷积层Conv,另外一个是否为残差网络的标记(shortcut),该标记体现在输出示例中的C2f的第四个参数:
代码如下:
class Bottleneck(nn.Module):
"""Standard bottleneck."""
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):
"""Initializes a bottleneck module with given input/output channels, shortcut option, group, kernels, and
expansion.
"""
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, k[0], 1)
self.cv2 = Conv(c_, c2, k[1], 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
"""'forward()' applies the YOLO FPN to input data."""
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
2.4 backbone的组件3——C2f
CSPLayer的主要思路是将基本层的特征按照通道分为两部分,其中一部分走DenseNet+转换层,和另一部分合并后再走转化层。
而yolov8的C2f其实现图如下:
C2f的实现和标准的CSPNet不同,中间步骤没有连接操作。C2f主要包括:
- 1,1个1*1的Conv层
- 2,n个Bottlenect层,其中有shortcut参数
- 3,连接后的Conv层
我们举个例子,以yolov8l(w=1, d=1),图像大小为640*640为例。参考下面框架图和输出可以知道。第四层的输入为80*80*256*w,n=6*d-6,该层依次执行:
- 1,做一次卷积,输出为80*80*256
- 2,基于通道数256,分成两部分,y数组为y0, y1,维度都是 80*80*128
- 3,对y1做一次Bottleneck(输入输出的通道数量不变),其输出y2追加到数组后,y数组为y0, y1, y2
- 4,对y2再做5次(n=6)Bottleneck,y数组为y0, y1, y2, y3。维度都是 80*80*128
- 5,将y数组基于通道维度做连接,变为 80*80*512,其中512来源于64*(n+2)
- 6,将上面连接后的输出,做一次Conv,降维成 80*80*256
代码如下:
class C2f(nn.Module):
"""Faster Implementation of CSP Bottleneck with 2 convolutions."""
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):
"""Initialize CSP bottleneck layer with two convolutions with arguments ch_in, ch_out, number, shortcut, groups,
expansion.
"""
super().__init__()
self.c = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, 2 * self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
"""Forward pass through C2f layer."""
y = list(self.cv1(x).chunk(2, 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x):
"""Forward pass using split() instead of chunk()."""
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
2.5 backbone的组件4——SPPF
SPP(Spatial Pyramid Pooling)空间金字塔池化通道在特征图上执行不同大小的池化操作,并将结果进行整合,从而得到固定尺寸的输出。这种技术可以有效地处理尺寸变换多样地目标,从而提高了神经网络地泛化能力和鲁棒性。SPP网络结构图如下:
YOLOv8的SPPF其架构图如下:
而YOLOv8地SPPF(Spatial Pyramid Pooling Fast)不是为了得到一个固定大小地表示层,而为了在不同大小尺寸做Pooling。具体过程如下:
- 1,用1*1的卷积将输入降维到一半的大小
- 2,分别用kernel为5做三次nn.MaxPool2d最大池化操作,加上本身连接。
- 3,将连接后的降维到输出层通道数
代码如下:
class SPPF(nn.Module):
"""Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher."""
def __init__(self, c1, c2, k=5):
"""
Initializes the SPPF layer with given input/output channels and kernel size.
This module is equivalent to SPP(k=(5, 9, 13)).
"""
super().__init__()
c_ = c1 // 2 # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c_ * 4, c2, 1, 1)
self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
def forward(self, x):
"""Forward pass through Ghost Convolution block."""
y = [self.cv1(x)]
y.extend(self.m(y[-1]) for _ in range(3))
return self.cv2(torch.cat(y, 1))
2.6 backbone的代码组合
上面我们已经梳理清楚了。下面直接写即可。代码如下:
import torch.nn as nn
import torch
from yolov8_blocks import *
class PoseModel(nn.Module):
"""YOLOv8 pose model."""
def __init__(self, base_channels, base_depth, deep_mul):
super().__init__()
self.stem = Conv(c1=3, c2=base_channels, k=3, s=2, p=1)
self.stage_layer1 = nn.Sequential(
Conv(c1=base_channels, c2=base_channels * 2, k=3, s=2, p=1),
C2f(base_channels * 2, base_channels * 2, base_depth, True),
)
self.stage_layer2 = nn.Sequential(
Conv(c1=base_channels * 2, c2=base_channels * 4, k=3, s=2, p=1),
C2f(base_channels * 4, base_channels * 4, base_depth * 2, True),
)
self.stage_layer3 = nn.Sequential(
Conv(c1=base_channels * 4, c2=base_channels * 8, k=3, s=2, p=1),
C2f(base_channels * 8, base_channels * 8, base_depth * 2, True),
)
self.stage_layer4 = nn.Sequential(
Conv(c1=base_channels * 8, c2=int(base_channels * 16 * deep_mul), k=3, s=2, p=1),
C2f(int(base_channels * 16 * deep_mul), int(base_channels * 16 * deep_mul), base_depth, True),
SPPF(int(base_channels * 16 * deep_mul), int(base_channels * 16 * deep_mul), k=5)
)
def forward(self, x):
x = self.stem(x)
x = self.stage_layer1(x)
x = self.stage_layer2(x)
feat1 = x
x = self.stage_layer3(x)
feat2 = x
x = self.stage_layer4(x)
feat3 = x
return feat1, feat2, feat3
if __name__ == "__main__":
phi = "n"
depth_dict = {'n' : 0.33, 's' : 0.33, 'm' : 0.67, 'l' : 1.00, 'x' : 1.00,}
width_dict = {'n' : 0.25, 's' : 0.50, 'm' : 0.75, 'l' : 1.00, 'x' : 1.25,}
deep_width_dict = {'n' : 1.00, 's' : 1.00, 'm' : 0.75, 'l' : 0.50, 'x' : 0.50,}
dep_mul, wid_mul, deep_mul = depth_dict[phi], width_dict[phi], deep_width_dict[phi]
base_channels = int(wid_mul * 64) # 64
base_depth = max(round(dep_mul * 3), 1) # 3
#-----------------------------------------------#
# 输入图片是640, 640, 3
#-----------------------------------------------#
#---------------------------------------------------#
# 生成主干模型
# 获得三个有效特征层,他们的shape分别是:
# 256 * deep_mul, 80, 80
# 512 * deep_mul, 40, 40
# 1024 * deep_mul, 20, 20
#---------------------------------------------------#
model = PoseModel(base_channels, base_depth, deep_mul)
print(model)
可以看到,还是比较简单的,我们直接打印模型的结构和原始进行对比,发现虽然结构一样,但是我们的命名规则和他不是很相似,它是直接指的是模型的一部分而已。搭建在一个Sequential,而我们是指向其起始部分。为了使用yolov8的权重文件,我们可以让模型的命名方式和预训练权重文件保持一致。这样我们可以根据它的权重文件的命名方式来调整我们的模型定义。当然也可以在加载权重进行适当的映射。选择哪种方法取决于你的具体情况和需求。如果你可以自由修改模型定义,调整模型定义可能更简单直接。而如果模型定义是固定的,或者由第三方库生成,映射权重可能是一个更好的选择。无论哪种方法,关键是确保模型结构和权重文件中的键名完全匹配,这样才能成功加载预训练的权重。
我这里就是修改模型定义了。同时可以通过遍历模型的named_parameters()
来访问模型的参数及其名称。这将返回一个迭代器,其中每个元素都是一个包含参数名称和参数张量的元组。
我首先打印了自己修改的模型定义,使用下面代码:
model = PoseModel(base_channels, base_depth, deep_mul)
for name, param in model.named_parameters():
print(name)
然后通过加载预训练权重,打印了预训练权重的模型参数:
# 加载权重文件
ckpt_path = r"D:ultralytics\weights\yolov8n-pose.pt"
weights = torch.load(ckpt_path, map_location=torch.device('cpu'))
weights_model = weights["model"] if isinstance(weights, dict) else weights
weights_model = torch.nn.Module.state_dict(weights_model)
# 打印权重文件中的参数值
for name, param in weights_model.items():
# print(f"Weight File Parameter '{name}': {param}")
print(name)
我进行了对比之后,将我写的模型层数和yolov8的backbone进行了对齐。修改backbone的代码如下:
class PoseModel(nn.Module):
"""YOLOv8 pose model."""
def __init__(self, base_channels, base_depth, deep_mul):
super().__init__()
self.model = nn.Sequential(
Conv(c1=3, c2=base_channels, k=3, s=2, p=1),
Conv(c1=base_channels, c2=base_channels * 2, k=3, s=2, p=1),
C2f(base_channels * 2, base_channels * 2, base_depth, True),
Conv(c1=base_channels * 2, c2=base_channels * 4, k=3, s=2, p=1),
C2f(base_channels * 4, base_channels * 4, base_depth * 2, True),
Conv(c1=base_channels * 4, c2=base_channels * 8, k=3, s=2, p=1),
C2f(base_channels * 8, base_channels * 8, base_depth * 2, True),
Conv(c1=base_channels * 8, c2=int(base_channels * 16 * deep_mul), k=3, s=2, p=1),
C2f(int(base_channels * 16 * deep_mul), int(base_channels * 16 * deep_mul), base_depth, True),
SPPF(int(base_channels * 16 * deep_mul), int(base_channels * 16 * deep_mul), k=5)
)
def forward(self, x):
feat1 = x
feat2 = x
feat3 = x
for layer in self.model[:5]:
feat1 = layer(feat1)
for layer in self.model[:7]:
feat2 = layer(feat2)
feat3 = self.model(feat3)
return feat1, feat2, feat3
2.7 yolov8的预训练权重中backbone的提取
我模型已经写好了,下面就是要加载预训练权重了,众所周知人家训练好的权重肯定包括backbone, neck, head。而我仅仅需要backbone部分,所以这里就需要将backbone部分的权重文件提取出来了。
根据对代码的分析,我筛选出了backbone部分的权重。即保留前缀为model.1到model.9的键值对。
其代码如下:
# 加载整个模型的权重
weight_path = r"D:ultralytics\weights\yolov8n-pose.pt"
weights = torch.load(weight_path, map_location=torch.device('cpu'))
weights_model = weights["model"] if isinstance(weights, dict) else weights
full_model_state_dict = torch.nn.Module.state_dict(weights_model)
# 筛选出backbone部分的权重
prefixes = [f'model.{i}.' for i in range(0, 10)]
backbone_state_dict = {k: v for k, v in full_model_state_dict.items() if any(k.startswith(prefix) for prefix in prefixes)}
# 保存筛选后的权重到新的.pth文件
output_path = r"D:ultralytics\weights\yolov8n-pose-backbone.pt"
torch.save(backbone_state_dict, output_path)
我们保存后,直接打印其模型结构:
weights_path = r"weights\yolov8n-pose-backbone.pt"
saved_weights = torch.load(weights_path, map_location=torch.device('cpu'))
for name, param in saved_weights.items():
# print(f"Weight File Parameter '{name}': {param}")
print(name)
部分结果如下:
这里附一个自己加载yolov8权重的代码:
def load(self, file, verbose=True):
"""
Load the weights into the model.
Args:
file (dict | torch.nn.Module): The pre-trained weights to be loaded.
verbose (bool, optional): Whether to log the transfer progress. Defaults to True.
"""
weights = torch.load(file, map_location="cpu")
model = weights["model"] if isinstance(weights, dict) else weights # torchvision models are not dicts
csd = model.float().state_dict() # checkpoint state_dict as FP32
# csd = intersect_dicts(csd, self.state_dict()) # intersect
self.load_state_dict(csd, strict=True) # load
2.8 使用自己写的backbone加载预训练模型
当我们准备好了从预训练权重中抽取的模型的时候,我们就可以使用自己写的模型来加载预训练权重了。
代码如下:
model = PoseModel(base_channels, base_depth, deep_mul)
# print(model)
ckpt_path = r"\ultralytics\weights\yolov8n-pose-backbone.pt"
saved_weights = torch.load(ckpt_path, map_location=torch.device('cpu'))
# 尝试加载权重到模型
model.load_state_dict(saved_weights)
print("successful")
这样就完成了对yolov8 backbone的所有理解。
2.9 使用自己写的backbone进行推理尝试
模型加载好之后,就是喂入数据,查看输出了。
phi = "n"
depth_dict = {'n' : 0.33, 's' : 0.33, 'm' : 0.67, 'l' : 1.00, 'x' : 1.00,}
width_dict = {'n' : 0.25, 's' : 0.50, 'm' : 0.75, 'l' : 1.00, 'x' : 1.25,}
deep_width_dict = {'n' : 1.00, 's' : 1.00, 'm' : 0.75, 'l' : 0.50, 'x' : 0.50,}
dep_mul, wid_mul, deep_mul = depth_dict[phi], width_dict[phi], deep_width_dict[phi]
base_channels = int(wid_mul * 64) # 64
base_depth = max(round(dep_mul * 3), 1) # 3
#-----------------------------------------------#
# 输入图片是640, 640, 3
#-----------------------------------------------#
#---------------------------------------------------#
# 生成主干模型
# 获得三个有效特征层,他们的shape分别是:
# 256 * deep_mul, 80, 80
# 512 * deep_mul, 40, 40
# 1024 * deep_mul, 20, 20
#---------------------------------------------------#
model = PoseModel(base_channels, base_depth, deep_mul)
# print(model)
ckpt_path = r"D:\work\workdata\public\keypoint_code\ultralytics\weights\yolov8n-pose-backbone.pt"
saved_weights = torch.load(ckpt_path, map_location=torch.device('cpu'))
# 尝试加载权重到模型
model.load_state_dict(saved_weights)
random_input = torch.randn(8, 3, 640, 640)
res = model.forward(random_input)
print(len(res))
print(res[0].shape, res[1].shape, res[2].shape)
print("successful")
打印结果如下:
结果符合预期。