[来源]((15条消息) 睿智的目标检测60——Pytorch搭建YoloV7目标检测平台_Bubbliiiing的博客-CSDN博客)
改进
- 主干部分:创新的 多分支堆叠结构 进行特征提取:相比之前的yolo,模型的跳跃连接更加密集。创新的下采样结构,使用Maxpooling和步长位2x2的特征并行进行提取与压缩
- 加强特征提取:创新多分支堆叠结构 进行特征提取
- 特殊的SPP结构:使用具有CSP结构的SPP扩大感受野:在SPP结构中引入CSP结构,该模块具有一个大的残差边辅助优化与特征提取
- 自适应多正样本匹配:yolov5之前,训练时每一个真实框对应一个正样本(也就是每一个真实框仅由一个先验框负责预测)
yolov7为了加快模型训练的效率,增加正样本的数量。训练时,每一个真实框由多个先验框负责预测。
对于每个真实框,还会根据先验框调整后的预测框进行IOU与种类的计算---->获得cost,找到最适合该真实框的先验框。 - 重参数:借鉴RepVGG结构,在网络的特定部分引入RepConv + fuse 后,不减少网络的性能基础上, 减少网络的参数量
- 使用辅助分支 辅助收敛,在模型较小的v7和v7x没有使用
网络
- 特征融合的目的是结合不同尺度的特征信息
- neck部分依旧使用 Panet结构:对特征进行上采用实现特征融合,还会对特征进行瞎猜样实现特征融合
- Yolo Head是YoloV7的分类器与回归器。
我们可以将特征图看作一个又一个特征点的集合,每个特征点上有三个先验框,每一个先验框都有通道数个特征。
Yolo Head实际上所做的工作就是对特征点进行判断,判断特征点上的先验框是否有物体与其对应。 - 与以前版本的Yolo一样,YoloV7所用的解耦头是一起的,也就是分类和回归在一个1X1卷积里实现
主干backbone
特点:
- 使用多分支堆叠模块。
- 最终堆叠模块的输入包含多个分支
- 左一为一个卷积标准化激活函数,左二为一个卷积标准化激活函数,右二为三个卷积标准化激活函数,右一为五个卷积标准化激活函数
- 多的堆叠模块目的:对应了更密集的残差结构(特点:容易优化,并且能够通过增加相当的深度来提高准确率)
残差结构内部的残差块使用了跳跃连接,缓解了在深度神经网络中增加深度带来的梯度消失问题
- 创新的过渡模块Transition_Block---->来进行下采样
多分支堆叠模块代码
class Multi_Concat_Block(nn.Module):
def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]):
super(Multi_Concat_Block, self).__init__()
c_ = int(c2 * e)
self.ids = ids
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = nn.ModuleList(
[Conv(c_ if i ==0 else c2, c2, 3, 1) for i in range(n)]
)
self.cv4 = Conv(c_ * 2 + c2 * (len(ids) - 2), c3, 1, 1)
def forward(self, x):
x_1 = self.cv1(x)
x_2 = self.cv2(x)
x_all = [x_1, x_2]
for i in range(len(self.cv3)):
x_2 = self.cv3[i](x_2)
x_all.append(x_2)
out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
return out
过渡模块Transition_Block
class MP(nn.Module):
def __init__(self, k=2):
super(MP, self).__init__()
self.m = nn.MaxPool2d(kernel_size=k, stride=k)
def forward(self, x):
return self.m(x)
class Transition_Block(nn.Module):
def __init__(self, c1, c2):
super(Transition_Block, self).__init__()
self.cv1 = Conv(c1, c2, 1, 1)
self.cv2 = Conv(c1, c2, 1, 1)
self.cv3 = Conv(c2, c2, 3, 2)
self.mp = MP()
def forward(self, x):
x_1 = self.mp(x)
x_1 = self.cv1(x_1)
x_2 = self.cv2(x)
x_2 = self.cv3(x_2)
return torch.cat([x_2, x_1], 1)
主干网络代码
import torch
import torch.nn as nn
def autopad(k, p=None):
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k]
return p
class SiLU(nn.Module):
@staticmethod
def forward(x):
return x * torch.sigmoid(x)
class Conv(nn.Module):
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=SiLU()): # ch_in, ch_out, kernel, stride, padding, groups
super(Conv, self).__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2, eps=0.001, momentum=0.03)
self.act = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def fuseforward(self, x):
return self.act(self.conv(x))
class Multi_Concat_Block(nn.Module):
def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]):
super(Multi_Concat_Block, self).__init__()
c_ = int(c2 * e)
self.ids = ids
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = nn.ModuleList(
[Conv(c_ if i ==0 else c2, c2, 3, 1) for i in range(n)]
)
self.cv4 = Conv(c_ * 2 + c2 * (len(ids) - 2), c3, 1, 1)
def forward(self, x):
x_1 = self.cv1(x)
x_2 = self.cv2(x)
x_all = [x_1, x_2]
for i in range(len(self.cv3)):
x_2 = self.cv3[i](x_2)
x_all.append(x_2)
out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
return out
class MP(nn.Module):
def __init__(self, k=2):
super(MP, self).__init__()
self.m = nn.MaxPool2d(kernel_size=k, stride=k)
def forward(self, x):
return self.m(x)
class Transition_Block(nn.Module):
def __init__(self, c1, c2):
super(Transition_Block, self).__init__()
self.cv1 = Conv(c1, c2, 1, 1)
self.cv2 = Conv(c1, c2, 1, 1)
self.cv3 = Conv(c2, c2, 3, 2)
self.mp = MP()
def forward(self, x):
x_1 = self.mp(x)
x_1 = self.cv1(x_1)
x_2 = self.cv2(x)
x_2 = self.cv3(x_2)
return torch.cat([x_2, x_1], 1)
class Backbone(nn.Module):
def __init__(self, transition_channels, block_channels, n, phi, pretrained=False):
super().__init__()
#-----------------------------------------------#
# 输入图片是640, 640, 3
#-----------------------------------------------#
ids = {
'l' : [-1, -3, -5, -6],
'x' : [-1, -3, -5, -7, -8],
}[phi]
self.stem = nn.Sequential(
Conv(3, transition_channels, 3, 1),
Conv(transition_channels, transition_channels * 2, 3, 2),
Conv(transition_channels * 2, transition_channels * 2, 3, 1),
)
self.dark2 = nn.Sequential(
Conv(transition_channels * 2, transition_channels * 4, 3, 2),
Multi_Concat_Block(transition_channels * 4, block_channels * 2, transition_channels * 8, n=n, ids=ids),
)
self.dark3 = nn.Sequential(
Transition_Block(transition_channels * 8, transition_channels * 4),
Multi_Concat_Block(transition_channels * 8, block_channels * 4, transition_channels * 16, n=n, ids=ids),
)
self.dark4 = nn.Sequential(
Transition_Block(transition_channels * 16, transition_channels * 8),
Multi_Concat_Block(transition_channels * 16, block_channels * 8, transition_channels * 32, n=n, ids=ids),
)
self.dark5 = nn.Sequential(
Transition_Block(transition_channels * 32, transition_channels * 16),
Multi_Concat_Block(transition_channels * 32, block_channels * 8, transition_channels * 32, n=n, ids=ids),
)
if pretrained:
url = {
"l" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_backbone_weights.pth',
"x" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_x_backbone_weights.pth',
}[phi]
checkpoint = torch.hub.load_state_dict_from_url(url=url, map_location="cpu", model_dir="./model_data")
self.load_state_dict(checkpoint, strict=False)
print("Load weights from " + url.split('/')[-1])
def forward(self, x):
x = self.stem(x)
x = self.dark2(x)
#-----------------------------------------------#
# dark3的输出为80, 80, 256,是一个有效特征层
#-----------------------------------------------#
x = self.dark3(x)
feat1 = x
#-----------------------------------------------#
# dark4的输出为40, 40, 512,是一个有效特征层
#-----------------------------------------------#
x = self.dark4(x)
feat2 = x
#-----------------------------------------------#
# dark5的输出为20, 20, 1024,是一个有效特征层
#-----------------------------------------------#
x = self.dark5(x)
feat3 = x
return feat1, feat2, feat3
neck
- 输入为(640,640,3)时,三个初步的有效特征层:
- shape分别为feat1=(80,80,256)、
- feat2=(40,40,512)、
- feat3=(20,20,1024)
特征融合过程:
- feat3=(20,20,1024)
- 首先利用SPPCSPC(归类到FPN中)进行特征提取---->该结构可以提高YoloV7的感受野,获得P5
- 对P5先进行1次1X1卷积调整通道
- 然后进行上采样UmSampling2d
- 与feat2=(40,40,512) 进行一次卷积后的特征层进行特征融合
- 然后使用Multi_Concat_Block进行特征提取获得P4(40,40,512)
- P4先进行1次1X1卷积调整通道
- 然后进行上采样UmSampling2d
- 与feat1=(80,80,256)进行一次卷积后的特征层进行特征融合
- 然后使用Multi_Concat_Block进行特征提取获得P3_out(80,80,256)
- P3_out=(80,80,256)的特征层进行一次Transition_Block卷积,进行下采样
- 下采样后与P4堆叠
- 然后使用Multi_Concat_Block进行特征提取P4_out(40,40,512)
- P4_out=(40,40,512)的特征层进行一次Transition_Block卷积进行下采样
- 下采样后与P5堆叠
- 然后使用Multi_Concat_Block进行特征提取P5_out(20,20,1024)
neck代码
#---------------------------------------------------#
# yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
def __init__(self, anchors_mask, num_classes, phi, pretrained=False):
super(YoloBody, self).__init__()
#-----------------------------------------------#
# 定义了不同yolov7版本的参数
#-----------------------------------------------#
transition_channels = {'l' : 32, 'x' : 40}[phi]
block_channels = 32
panet_channels = {'l' : 32, 'x' : 64}[phi]
e = {'l' : 2, 'x' : 1}[phi]
n = {'l' : 4, 'x' : 6}[phi]
ids = {'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]
conv = {'l' : RepConv, 'x' : Conv}[phi]
#-----------------------------------------------#
# 输入图片是640, 640, 3
#-----------------------------------------------#
#---------------------------------------------------#
# 生成主干模型
# 获得三个有效特征层,他们的shape分别是:
# 80, 80, 512
# 40, 40, 1024
# 20, 20, 1024
#---------------------------------------------------#
self.backbone = Backbone(transition_channels, block_channels, n, phi, pretrained=pretrained)
self.upsample = nn.Upsample(scale_factor=2, mode="nearest")
self.sppcspc = SPPCSPC(transition_channels * 32, transition_channels * 16)
self.conv_for_P5 = Conv(transition_channels * 16, transition_channels * 8)
self.conv_for_feat2 = Conv(transition_channels * 32, transition_channels * 8)
self.conv3_for_upsample1 = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)
self.conv_for_P4 = Conv(transition_channels * 8, transition_channels * 4)
self.conv_for_feat1 = Conv(transition_channels * 16, transition_channels * 4)
self.conv3_for_upsample2 = Multi_Concat_Block(transition_channels * 8, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids)
self.down_sample1 = Transition_Block(transition_channels * 4, transition_channels * 4)
self.conv3_for_downsample1 = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)
self.down_sample2 = Transition_Block(transition_channels * 8, transition_channels * 8)
self.conv3_for_downsample2 = Multi_Concat_Block(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)
self.rep_conv_1 = conv(transition_channels * 4, transition_channels * 8, 3, 1)
self.rep_conv_2 = conv(transition_channels * 8, transition_channels * 16, 3, 1)
self.rep_conv_3 = conv(transition_channels * 16, transition_channels * 32, 3, 1)
self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchors_mask[2]) * (5 + num_classes), 1)
self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchors_mask[1]) * (5 + num_classes), 1)
self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchors_mask[0]) * (5 + num_classes), 1)
def fuse(self):
print('Fusing layers... ')
for m in self.modules():
if isinstance(m, RepConv):
m.fuse_repvgg_block()
elif type(m) is Conv and hasattr(m, 'bn'):
m.conv = fuse_conv_and_bn(m.conv, m.bn)
delattr(m, 'bn')
m.forward = m.fuseforward
return self
def forward(self, x):
# backbone
feat1, feat2, feat3 = self.backbone.forward(x)
P5 = self.sppcspc(feat3)
P5_conv = self.conv_for_P5(P5)
P5_upsample = self.upsample(P5_conv)
P4 = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
P4 = self.conv3_for_upsample1(P4)
P4_conv = self.conv_for_P4(P4)
P4_upsample = self.upsample(P4_conv)
P3 = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
P3 = self.conv3_for_upsample2(P3)
P3_downsample = self.down_sample1(P3)
P4 = torch.cat([P3_downsample, P4], 1)
P4 = self.conv3_for_downsample1(P4)
P4_downsample = self.down_sample2(P4)
P5 = torch.cat([P4_downsample, P5], 1)
P5 = self.conv3_for_downsample2(P5)
P3 = self.rep_conv_1(P3)
P4 = self.rep_conv_2(P4)
P5 = self.rep_conv_3(P5)
#---------------------------------------------------#
# 第三个特征层
# y3=(batch_size, 75, 80, 80)
#---------------------------------------------------#
out2 = self.yolo_head_P3(P3)
#---------------------------------------------------#
# 第二个特征层
# y2=(batch_size, 75, 40, 40)
#---------------------------------------------------#
out1 = self.yolo_head_P4(P4)
#---------------------------------------------------#
# 第一个特征层
# y1=(batch_size, 75, 20, 20)
#---------------------------------------------------#
out0 = self.yolo_head_P5(P5)
return [out0, out1, out2]
head
-
获得三个加强特征
- shape分别为(20,20,1024)、
- (40,40,512)、
- (80,80,256)
-
与之前Yolo系列不同的是,YoloV7在Yolo Head前使用了一个RepConv的结构(取自于RepVGG)
- 基本思想:训练的时候引入 特殊的残差结构 辅助训练
- 这个残差结构经过独特设计的,在实际预测的时候,可以将复杂的残差结构等效成一个普通的3x3卷积---------->网络的复杂度下降,但是网络的预测性能没有下降
-
对于每一个特征层,我们可以获得利用一个卷积调整通道数
最终的通道数和需要区分的种类个数相关
YoloV5里,每一个特征层上每一个特征点存在3个先验框
head代码
import numpy as np
import torch
import torch.nn as nn
from nets.backbone import Backbone, Multi_Concat_Block, Conv, SiLU, Transition_Block, autopad
class RepConv(nn.Module):
# Represented convolution
# https://arxiv.org/abs/2101.03697
def __init__(self, c1, c2, k=3, s=1, p=None, g=1, act=SiLU(), deploy=False):
super(RepConv, self).__init__()
self.deploy = deploy
self.groups = g
self.in_channels = c1
self.out_channels = c2
assert k == 3
assert autopad(k, p) == 1
padding_11 = autopad(k, p) - k // 2
self.act = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
if deploy:
self.rbr_reparam = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=True)
else:
self.rbr_identity = (nn.BatchNorm2d(num_features=c1, eps=0.001, momentum=0.03) if c2 == c1 and s == 1 else None)
self.rbr_dense = nn.Sequential(
nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False),
nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
)
self.rbr_1x1 = nn.Sequential(
nn.Conv2d( c1, c2, 1, s, padding_11, groups=g, bias=False),
nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
)
def forward(self, inputs):
if hasattr(self, "rbr_reparam"):
return self.act(self.rbr_reparam(inputs))
if self.rbr_identity is None:
id_out = 0
else:
id_out = self.rbr_identity(inputs)
return self.act(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)
def get_equivalent_kernel_bias(self):
kernel3x3, bias3x3 = self._fuse_bn_tensor(self.rbr_dense)
kernel1x1, bias1x1 = self._fuse_bn_tensor(self.rbr_1x1)
kernelid, biasid = self._fuse_bn_tensor(self.rbr_identity)
return (
kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid,
bias3x3 + bias1x1 + biasid,
)
def _pad_1x1_to_3x3_tensor(self, kernel1x1):
if kernel1x1 is None:
return 0
else:
return nn.functional.pad(kernel1x1, [1, 1, 1, 1])
def _fuse_bn_tensor(self, branch):
if branch is None:
return 0, 0
if isinstance(branch, nn.Sequential):
kernel = branch[0].weight
running_mean = branch[1].running_mean
running_var = branch[1].running_var
gamma = branch[1].weight
beta = branch[1].bias
eps = branch[1].eps
else:
assert isinstance(branch, nn.BatchNorm2d)
if not hasattr(self, "id_tensor"):
input_dim = self.in_channels // self.groups
kernel_value = np.zeros(
(self.in_channels, input_dim, 3, 3), dtype=np.float32
)
for i in range(self.in_channels):
kernel_value[i, i % input_dim, 1, 1] = 1
self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
kernel = self.id_tensor
running_mean = branch.running_mean
running_var = branch.running_var
gamma = branch.weight
beta = branch.bias
eps = branch.eps
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta - running_mean * gamma / std
def repvgg_convert(self):
kernel, bias = self.get_equivalent_kernel_bias()
return (
kernel.detach().cpu().numpy(),
bias.detach().cpu().numpy(),
)
def fuse_conv_bn(self, conv, bn):
std = (bn.running_var + bn.eps).sqrt()
bias = bn.bias - bn.running_mean * bn.weight / std
t = (bn.weight / std).reshape(-1, 1, 1, 1)
weights = conv.weight * t
bn = nn.Identity()
conv = nn.Conv2d(in_channels = conv.in_channels,
out_channels = conv.out_channels,
kernel_size = conv.kernel_size,
stride=conv.stride,
padding = conv.padding,
dilation = conv.dilation,
groups = conv.groups,
bias = True,
padding_mode = conv.padding_mode)
conv.weight = torch.nn.Parameter(weights)
conv.bias = torch.nn.Parameter(bias)
return conv
def fuse_repvgg_block(self):
if self.deploy:
return
print(f"RepConv.fuse_repvgg_block")
self.rbr_dense = self.fuse_conv_bn(self.rbr_dense[0], self.rbr_dense[1])
self.rbr_1x1 = self.fuse_conv_bn(self.rbr_1x1[0], self.rbr_1x1[1])
rbr_1x1_bias = self.rbr_1x1.bias
weight_1x1_expanded = torch.nn.functional.pad(self.rbr_1x1.weight, [1, 1, 1, 1])
# Fuse self.rbr_identity
if (isinstance(self.rbr_identity, nn.BatchNorm2d) or isinstance(self.rbr_identity, nn.modules.batchnorm.SyncBatchNorm)):
identity_conv_1x1 = nn.Conv2d(
in_channels=self.in_channels,
out_channels=self.out_channels,
kernel_size=1,
stride=1,
padding=0,
groups=self.groups,
bias=False)
identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.to(self.rbr_1x1.weight.data.device)
identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.squeeze().squeeze()
identity_conv_1x1.weight.data.fill_(0.0)
identity_conv_1x1.weight.data.fill_diagonal_(1.0)
identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.unsqueeze(2).unsqueeze(3)
identity_conv_1x1 = self.fuse_conv_bn(identity_conv_1x1, self.rbr_identity)
bias_identity_expanded = identity_conv_1x1.bias
weight_identity_expanded = torch.nn.functional.pad(identity_conv_1x1.weight, [1, 1, 1, 1])
else:
bias_identity_expanded = torch.nn.Parameter( torch.zeros_like(rbr_1x1_bias) )
weight_identity_expanded = torch.nn.Parameter( torch.zeros_like(weight_1x1_expanded) )
self.rbr_dense.weight = torch.nn.Parameter(self.rbr_dense.weight + weight_1x1_expanded + weight_identity_expanded)
self.rbr_dense.bias = torch.nn.Parameter(self.rbr_dense.bias + rbr_1x1_bias + bias_identity_expanded)
self.rbr_reparam = self.rbr_dense
self.deploy = True
if self.rbr_identity is not None:
del self.rbr_identity
self.rbr_identity = None
if self.rbr_1x1 is not None:
del self.rbr_1x1
self.rbr_1x1 = None
if self.rbr_dense is not None:
del self.rbr_dense
self.rbr_dense = None
#---------------------------------------------------#
# yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
def __init__(self, anchors_mask, num_classes, phi, pretrained=False):
super(YoloBody, self).__init__()
#-----------------------------------------------#
# 定义了不同yolov7版本的参数
#-----------------------------------------------#
transition_channels = {'l' : 32, 'x' : 40}[phi]
block_channels = 32
panet_channels = {'l' : 32, 'x' : 64}[phi]
e = {'l' : 2, 'x' : 1}[phi]
n = {'l' : 4, 'x' : 6}[phi]
ids = {'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]
conv = {'l' : RepConv, 'x' : Conv}[phi]
#-----------------------------------------------#
# 输入图片是640, 640, 3
#-----------------------------------------------#
#---------------------------------------------------#
# 生成主干模型
# 获得三个有效特征层,他们的shape分别是:
# 80, 80, 512
# 40, 40, 1024
# 20, 20, 1024
#---------------------------------------------------#
self.backbone = Backbone(transition_channels, block_channels, n, phi, pretrained=pretrained)
self.upsample = nn.Upsample(scale_factor=2, mode="nearest")
self.sppcspc = SPPCSPC(transition_channels * 32, transition_channels * 16)
self.conv_for_P5 = Conv(transition_channels * 16, transition_channels * 8)
self.conv_for_feat2 = Conv(transition_channels * 32, transition_channels * 8)
self.conv3_for_upsample1 = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)
self.conv_for_P4 = Conv(transition_channels * 8, transition_channels * 4)
self.conv_for_feat1 = Conv(transition_channels * 16, transition_channels * 4)
self.conv3_for_upsample2 = Multi_Concat_Block(transition_channels * 8, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids)
self.down_sample1 = Transition_Block(transition_channels * 4, transition_channels * 4)
self.conv3_for_downsample1 = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)
self.down_sample2 = Transition_Block(transition_channels * 8, transition_channels * 8)
self.conv3_for_downsample2 = Multi_Concat_Block(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)
self.rep_conv_1 = conv(transition_channels * 4, transition_channels * 8, 3, 1)
self.rep_conv_2 = conv(transition_channels * 8, transition_channels * 16, 3, 1)
self.rep_conv_3 = conv(transition_channels * 16, transition_channels * 32, 3, 1)
self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchors_mask[2]) * (5 + num_classes), 1)
self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchors_mask[1]) * (5 + num_classes), 1)
self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchors_mask[0]) * (5 + num_classes), 1)
def fuse(self):
print('Fusing layers... ')
for m in self.modules():
if isinstance(m, RepConv):
m.fuse_repvgg_block()
elif type(m) is Conv and hasattr(m, 'bn'):
m.conv = fuse_conv_and_bn(m.conv, m.bn)
delattr(m, 'bn')
m.forward = m.fuseforward
return self
def forward(self, x):
# backbone
feat1, feat2, feat3 = self.backbone.forward(x)
P5 = self.sppcspc(feat3)
P5_conv = self.conv_for_P5(P5)
P5_upsample = self.upsample(P5_conv)
P4 = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
P4 = self.conv3_for_upsample1(P4)
P4_conv = self.conv_for_P4(P4)
P4_upsample = self.upsample(P4_conv)
P3 = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
P3 = self.conv3_for_upsample2(P3)
P3_downsample = self.down_sample1(P3)
P4 = torch.cat([P3_downsample, P4], 1)
P4 = self.conv3_for_downsample1(P4)
P4_downsample = self.down_sample2(P4)
P5 = torch.cat([P4_downsample, P5], 1)
P5 = self.conv3_for_downsample2(P5)
P3 = self.rep_conv_1(P3)
P4 = self.rep_conv_2(P4)
P5 = self.rep_conv_3(P5)
#---------------------------------------------------#
# 第三个特征层
# y3=(batch_size, 75, 80, 80)
#---------------------------------------------------#
out2 = self.yolo_head_P3(P3)
#---------------------------------------------------#
# 第二个特征层
# y2=(batch_size, 75, 40, 40)
#---------------------------------------------------#
out1 = self.yolo_head_P4(P4)
#---------------------------------------------------#
# 第一个特征层
# y1=(batch_size, 75, 20, 20)
#---------------------------------------------------#
out0 = self.yolo_head_P5(P5)
return [out0, out1, out2]
解码
- head之后,我们可以获得三个特征层的预测结果(此时预测结果不是预测框,只是一些调整参数)。shape:
- (N,20,20,255)
- (N,40,40,255)
- (N,80,80,255)
- 但是这个预测结果并不对应着最终的预测框在图片上的位置,还需要解码才可以完成。
在YoloV5里,每一个特征层上每一个特征点存在3个先验框 - 每个特征层最后的255可以拆分成3个85,对应3个先验框的85个参数。reshape
- (N,20,20,3,85)
- 85可以拆分成4+1+80
- 前4个参数用于判断每一个特征点的回归参数)(到时是不是用来调整先验框的参数呢???)
回归参数调整后可以获得预测框 - 第5个参数用于判断每一个特征点是否包含物体
- 最后80个参数用于判断每一个特征点所包含的物体种类
- (N,40.40,3,85)
- (N,80,80,3,85)
- (N,20,20,3,85)
- 解码操作流程:
- 中心预测点的计算,利用Regression预测结果前两个序号的内容,对三个先验框中心坐标进行偏移,偏移之后是右图的红色的三个点
- 进行预测框宽高的计算,利用Regression预测结果后两个序号的内容求指数后获得预测框的宽高;
- 此时获得的预测框就可以绘制在图片上了
以(N,20,20,3,85)这个特征层为例,该特征层相当于将图像划分成20x20个特征点,如果某个特征点落在物体的对应框内,就用于预测该物体。
解码代码
def decode_box(self, inputs):
outputs = []
for i, input in enumerate(inputs):
#-----------------------------------------------#
# 输入的input一共有三个,他们的shape分别是
# batch_size, 255, 20, 20
# batch_size, 255, 40, 40
# batch_size, 255, 80, 80
#-----------------------------------------------#
batch_size = input.size(0)
input_height = input.size(2)
input_width = input.size(3)
#-----------------------------------------------#
# 输入为416x416时
# stride_h = stride_w = 32、16、8
#-----------------------------------------------#
stride_h = self.input_shape[0] / input_height
stride_w = self.input_shape[1] / input_width
#-------------------------------------------------#
# 此时获得的scaled_anchors大小是相对于特征层的
#-------------------------------------------------#
scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors[self.anchors_mask[i]]]
#-----------------------------------------------#
# 输入的input一共有三个,他们的shape分别是
# batch_size, 3, 20, 20, 85
# batch_size, 3, 40, 40, 85
# batch_size, 3, 80, 80, 85
#-----------------------------------------------#
prediction = input.view(batch_size, len(self.anchors_mask[i]),
self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()
#-----------------------------------------------#
# 先验框的中心位置的调整参数
#-----------------------------------------------#
x = torch.sigmoid(prediction[..., 0])
y = torch.sigmoid(prediction[..., 1])
#-----------------------------------------------#
# 先验框的宽高调整参数
#-----------------------------------------------#
w = torch.sigmoid(prediction[..., 2])
h = torch.sigmoid(prediction[..., 3])
#-----------------------------------------------#
# 获得置信度,是否有物体
#-----------------------------------------------#
conf = torch.sigmoid(prediction[..., 4])
#-----------------------------------------------#
# 种类置信度
#-----------------------------------------------#
pred_cls = torch.sigmoid(prediction[..., 5:])
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
#----------------------------------------------------------#
# 生成网格,先验框中心,网格左上角
# batch_size,3,20,20
#----------------------------------------------------------#
grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_height, 1).repeat(
batch_size * len(self.anchors_mask[i]), 1, 1).view(x.shape).type(FloatTensor)
grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_width, 1).t().repeat(
batch_size * len(self.anchors_mask[i]), 1, 1).view(y.shape).type(FloatTensor)
#----------------------------------------------------------#
# 按照网格格式生成先验框的宽高
# batch_size,3,20,20
#----------------------------------------------------------#
anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)
#----------------------------------------------------------#
# 利用预测结果对先验框进行调整
# 首先调整先验框的中心,从先验框中心向右下角偏移
# 再调整先验框的宽高。
#----------------------------------------------------------#
pred_boxes = FloatTensor(prediction[..., :4].shape)
pred_boxes[..., 0] = x.data * 2. - 0.5 + grid_x
pred_boxes[..., 1] = y.data * 2. - 0.5 + grid_y
pred_boxes[..., 2] = (w.data * 2) ** 2 * anchor_w
pred_boxes[..., 3] = (h.data * 2) ** 2 * anchor_h
#----------------------------------------------------------#
# 将输出结果归一化成小数的形式
#----------------------------------------------------------#
_scale = torch.Tensor([input_width, input_height, input_width, input_height]).type(FloatTensor)
output = torch.cat((pred_boxes.view(batch_size, -1, 4) / _scale,
conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
outputs.append(output.data)
return outputs
得分筛选与非极大值
-
得分筛选就是筛选出得分满足confidence置信度的预测框。
-
非极大抑制就是筛选出一定区域内属于同一种类得分最大的框
-
过程:
-
1、找出该图片中得分大于门限函数的框。在进行重合框筛选前就进行得分的筛选可以大幅度减少框的数量。
-
2、对种类进行循环,非极大抑制的作用是筛选出一定区域内属于同一种类得分最大的框,对种类进行循环可以帮助我们对每一个类分别进行非极大抑制。
-
3、根据得分对该种类进行从大到小排序。
-
4、每次取出得分最大的框,计算其与其它所有预测框的重合程度,重合程度过大的则剔除。
得分筛选与非极大抑制后的结果就可以用于绘制预测框了。
————————————————
版权声明:本文为CSDN博主「Bubbliiiing」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_44791964/article/details/125827160
-
得分筛选与非极大值 代码
def non_max_suppression(self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4):
#----------------------------------------------------------#
# 将预测结果的格式转换成左上角右下角的格式。
# prediction [batch_size, num_anchors, 85]
#----------------------------------------------------------#
box_corner = prediction.new(prediction.shape)
box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2
box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2
box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2
box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2
prediction[:, :, :4] = box_corner[:, :, :4]
output = [None for _ in range(len(prediction))]
for i, image_pred in enumerate(prediction):
#----------------------------------------------------------#
# 对种类预测部分取max。
# class_conf [num_anchors, 1] 种类置信度
# class_pred [num_anchors, 1] 种类
#----------------------------------------------------------#
class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True)
#----------------------------------------------------------#
# 利用置信度进行第一轮筛选
#----------------------------------------------------------#
conf_mask = (image_pred[:, 4] * class_conf[:, 0] >= conf_thres).squeeze()
#----------------------------------------------------------#
# 根据置信度进行预测结果的筛选
#----------------------------------------------------------#
image_pred = image_pred[conf_mask]
class_conf = class_conf[conf_mask]
class_pred = class_pred[conf_mask]
if not image_pred.size(0):
continue
#-------------------------------------------------------------------------#
# detections [num_anchors, 7]
# 7的内容为:x1, y1, x2, y2, obj_conf, class_conf, class_pred
#-------------------------------------------------------------------------#
detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1)
#------------------------------------------#
# 获得预测结果中包含的所有种类
#------------------------------------------#
unique_labels = detections[:, -1].cpu().unique()
if prediction.is_cuda:
unique_labels = unique_labels.cuda()
detections = detections.cuda()
for c in unique_labels:
#------------------------------------------#
# 获得某一类得分筛选后全部的预测结果
#------------------------------------------#
detections_class = detections[detections[:, -1] == c]
#------------------------------------------#
# 使用官方自带的非极大抑制会速度更快一些!
#------------------------------------------#
keep = nms(
detections_class[:, :4],
detections_class[:, 4] * detections_class[:, 5],
nms_thres
)
max_detections = detections_class[keep]
# # 按照存在物体的置信度排序
# _, conf_sort_index = torch.sort(detections_class[:, 4]*detections_class[:, 5], descending=True)
# detections_class = detections_class[conf_sort_index]
# # 进行非极大抑制
# max_detections = []
# while detections_class.size(0):
# # 取出这一类置信度最高的,一步一步往下判断,判断重合程度是否大于nms_thres,如果是则去除掉
# max_detections.append(detections_class[0].unsqueeze(0))
# if len(detections_class) == 1:
# break
# ious = bbox_iou(max_detections[-1], detections_class[1:])
# detections_class = detections_class[1:][ious < nms_thres]
# # 堆叠
# max_detections = torch.cat(max_detections).data
# Add max detections to outputs
output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections))
if output[i] is not None:
output[i] = output[i].cpu().numpy()
box_xy, box_wh = (output[i][:, 0:2] + output[i][:, 2:4])/2, output[i][:, 2:4] - output[i][:, 0:2]
output[i][:, :4] = self.yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
return output
训练
1. 计算loss所需要的参数
- pre:预测结果
- target:真实框情况
- 和网络的预测结果一样,网络的损失由三个部分组成:
- Reg部分:特征点的回归参数判断
- Obj部分:特征点是否包含物体
- Cls部分:特征点包含物体的种类
2. 正样本的匹配过程
寻找哪些先验框被 认为有对应的真实框,并负责这个真实框的预测
a.每个真实框通过坐标与宽高粗略匹配先验框与特征点
-
匹配先验框
yolov7 一共设计9个不同大小的先验框。每个输出的特征层对应3个先验框-
改进前:使用IOU进行正样本匹配
-
改进后:直接采用宽高比进行匹配
即使用真实框和9个不同大小的先验框计算宽高比
如果真实框与某个先验框的宽高比例大于设定阈值,则说明该真实框和先验框的匹配度不够,将该先验框认为是负样本 (宽高比越大说明越不重合) -
案例
比如此时有一个真实框,它的宽高为[200, 200],是一个正方形。YoloV7 默认设置的9个先验框为
[12,16],[19,36], [40,28] 80x80
[36,75],[76,55],[72,146]40x40
[142,110], [192,243],[459,401]20x20
设定阈值门限为4此时我们需要计算该真实框和9个先验框的宽高比例。
比较宽高时存在两个情况,一个是真实框的宽高比先验框大,一个是先验框的宽高比真实框大。
因此我们需要同时计算:真实框的宽高/先验框的宽高;先验框的宽高/真实框的宽高。然后在这其中选取最大值。之后判断,哪些先验框的比较结果的值小于门限。[59,119], [116,90], [156,198], [373,326]四个先验框均满足需求。
- 前面四个大于门限 是负样本
- 后面四个小于门限 是正样本[59,119], [116,90], [156,198], [373,326]—>判断出四类大小的先验框可用于该真实框的预测
- 其中,[116,90], [156,198], [373,326]属于20,20的特征层。
- [59,119]属于40,40的特征层。
-
这是一个shape为[9, 4]的矩阵,9代表9个先验框,4代表真实框的宽高/先验框的宽高;先验框的宽高/真实框的宽高。
[[20. 15.38461538 0.05 0.065 ]
[12.5 6.66666667 0.08 0.15 ]
[ 6.06060606 8.69565217 0.165 0.115 ]
[ 6.66666667 3.27868852 0.15 0.305 ]
[ 3.22580645 4.44444444 0.31 0.225 ]
[ 3.38983051 1.68067227 0.295 0.595 ]
[ 1.72413793 2.22222222 0.58 0.45 ]
[ 1.28205128 1.01010101 0.78 0.99 ]
[ 0.53619303 0.61349693 1.865 1.63 ]]
对每个先验框的比较结果取最大值
[20. 12.5 8.69565217 6.66666667 4.44444444 3.38983051
2.22222222 1.28205128 1.865 ]
-
匹配特征点
-
改进之前:真实框由其中心点 所在的网格内的左上角特征点 负责预测
-
改进之后:对于被选中的特征层,首先计算真实框的中心点落在哪个网格内,此时该网格左上角特征点便是一个负责预测的特征点。
同时利用四舍五入规则,找到最近的两个网格 这三个网格负责预测真实框 (预测狂的xy轴便宜部分取值范围0.5–1.5)
-
找到对应的特征点后,对应特征点在“匹配先验框”中被选中的先验框负责该真实框的预测
-
红色点表示该真实框的中心,除了当前所处的网格外,其2个最近的邻域网格也被选中。从这里就可以发现预测框的XY轴偏移部分的取值范围不再是0-1,而是0.5-1.5
匹配先验框和特征点代码
def find_3_positive(self, predictions, targets):
#------------------------------------#
# 获得每个特征层先验框的数量
# 与真实框的数量
#------------------------------------#
num_anchor, num_gt = len(self.anchors_mask[0]), targets.shape[0]
#------------------------------------#
# 创建空列表存放indices和anchors
#------------------------------------#
indices, anchors = [], []
#------------------------------------#
# 创建7个1
# 序号0,1为1
# 序号2:6为特征层的高宽
# 序号6为1
#------------------------------------#
gain = torch.ones(7, device=targets.device)
#------------------------------------#
# ai [num_anchor, num_gt]
# targets [num_gt, 6] => [num_anchor, num_gt, 7]
#------------------------------------#
ai = torch.arange(num_anchor, device=targets.device).float().view(num_anchor, 1).repeat(1, num_gt)
targets = torch.cat((targets.repeat(num_anchor, 1, 1), ai[:, :, None]), 2) # append anchor indices
g = 0.5 # offsets
off = torch.tensor([
[0, 0],
[1, 0], [0, 1], [-1, 0], [0, -1], # j,k,l,m
# [1, 1], [1, -1], [-1, 1], [-1, -1], # jk,jm,lk,lm
], device=targets.device).float() * g
for i in range(len(predictions)):
#----------------------------------------------------#
# 将先验框除以stride,获得相对于特征层的先验框。
# anchors_i [num_anchor, 2]
#----------------------------------------------------#
anchors_i = torch.from_numpy(self.anchors[i] / self.stride[i]).type_as(predictions[i])
#-------------------------------------------#
# 计算获得对应特征层的高宽
#-------------------------------------------#
gain[2:6] = torch.tensor(predictions[i].shape)[[3, 2, 3, 2]]
#-------------------------------------------#
# 将真实框乘上gain,
# 其实就是将真实框映射到特征层上
#-------------------------------------------#
t = targets * gain
if num_gt:
#-------------------------------------------#
# 计算真实框与先验框高宽的比值
# 然后根据比值大小进行判断,
# 判断结果用于取出,获得所有先验框对应的真实框
# r [num_anchor, num_gt, 2]
# t [num_anchor, num_gt, 7] => [num_matched_anchor, 7]
#-------------------------------------------#
r = t[:, :, 4:6] / anchors_i[:, None]
j = torch.max(r, 1. / r).max(2)[0] < self.threshold
t = t[j] # filter
#-------------------------------------------#
# gxy 获得所有先验框对应的真实框的x轴y轴坐标
# gxi 取相对于该特征层的右小角的坐标
#-------------------------------------------#
gxy = t[:, 2:4] # grid xy
gxi = gain[[2, 3]] - gxy # inverse
j, k = ((gxy % 1. < g) & (gxy > 1.)).T
l, m = ((gxi % 1. < g) & (gxi > 1.)).T
j = torch.stack((torch.ones_like(j), j, k, l, m))
#-------------------------------------------#
# t 重复5次,使用满足条件的j进行框的提取
# j 一共五行,代表当前特征点在五个
# [0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]
# 方向是否存在
#-------------------------------------------#
t = t.repeat((5, 1, 1))[j]
offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
else:
t = targets[0]
offsets = 0
#-------------------------------------------#
# b 代表属于第几个图片
# gxy 代表该真实框所处的x、y中心坐标
# gwh 代表该真实框的wh坐标
# gij 代表真实框所属的特征点坐标
#-------------------------------------------#
b, c = t[:, :2].long().T # image, class
gxy = t[:, 2:4] # grid xy
gwh = t[:, 4:6] # grid wh
gij = (gxy - offsets).long()
gi, gj = gij.T # grid xy indices
#-------------------------------------------#
# gj、gi不能超出特征层范围
# a代表属于该特征点的第几个先验框
#-------------------------------------------#
a = t[:, 6].long() # anchor indices
indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1))) # image, anchor, grid indices
anchors.append(anchors_i[a]) # anchors
return indices, anchors
但这一步仅仅是粗略的筛选,后面我们会通过simOTA来精确筛选
b.使用SimOTA自适应 精确选取每个真实框对应多少个先验框
在yolov7中,计算一个cost代价矩阵,代表每一个真实框和每一个特征点之间的代价关系. cost代价矩阵 由三个部分(???)组成:
- 每个真实框和当前特征点 预测框的 重合度
- 每个真实框和当前特征点预测框的重合程度越高,代表这个特征点 已经尝试去拟合该真实框了,因此它的Cost代价就会越小。
- 每个真实框和当前特征点 预测框的 种类预测准确度
- 每个真实框和当前特征点预测框的种类预测准确度越高,也代表这个特征点 已经尝试去拟合该真实框了,因此它的Cost代价就会越小
cost代价矩阵的目的是:自适应的找到 当前特征点 应该去拟合的真实框,重合度越高,越需要拟合,分类越准越需要拟合,在一定半径内越需要拟合
在SimOTA中,不同目标设定不同的正样本数量(dynamick)
以旷视科技官方回答中的蚂蚁和西瓜为例子,传统的正样本分配方案常常为同一场景下的西瓜和蚂蚁分配同样的正样本数,那要么蚂蚁有很多低质量的正样本,要么西瓜仅仅只有一两个正样本。对于哪个分配方式都是不合适的。
动态的正样本设置的关键在于如何确定k,SimOTA具体的做法是首先计算每个目标Cost最低的10特征点,然后把这十个特征点对应的预测框与真实框的IOU加起来求得最终的k.
- SimOTA的过程:
- 每个真实框和当前特征点 预测框的 重合度
- 计算将重合度最高的二十个 预测框与真实框的IOU加起来求得每个真实框的k,也就代表每个真实框有k个特征点与之对应.
- 每个真实框和当前特征点 预测框的 种类预测准确度
- 计算Cost代价矩阵
- 将Cost最低的k个点 作为该真实框的正样本
def build_targets(self, predictions, targets, imgs):
#-------------------------------------------#
# 匹配正样本
#-------------------------------------------#
indices, anch = self.find_3_positive(predictions, targets)
matching_bs = [[] for _ in predictions]
matching_as = [[] for _ in predictions]
matching_gjs = [[] for _ in predictions]
matching_gis = [[] for _ in predictions]
matching_targets = [[] for _ in predictions]
matching_anchs = [[] for _ in predictions]
#-------------------------------------------#
# 一共三层
#-------------------------------------------#
num_layer = len(predictions)
#-------------------------------------------#
# 对batch_size进行循环,进行OTA匹配
# 在batch_size循环中对layer进行循环
#-------------------------------------------#
for batch_idx in range(predictions[0].shape[0]):
#-------------------------------------------#
# 先判断匹配上的真实框哪些属于该图片
#-------------------------------------------#
b_idx = targets[:, 0]==batch_idx
this_target = targets[b_idx]
#-------------------------------------------#
# 如果没有真实框属于该图片则continue
#-------------------------------------------#
if this_target.shape[0] == 0:
continue
#-------------------------------------------#
# 真实框的坐标进行缩放
#-------------------------------------------#
txywh = this_target[:, 2:6] * imgs[batch_idx].shape[1]
#-------------------------------------------#
# 从中心宽高到左上角右下角
#-------------------------------------------#
txyxy = self.xywh2xyxy(txywh)
pxyxys = []
p_cls = []
p_obj = []
from_which_layer = []
all_b = []
all_a = []
all_gj = []
all_gi = []
all_anch = []
#-------------------------------------------#
# 对三个layer进行循环
#-------------------------------------------#
for i, prediction in enumerate(predictions):
#-------------------------------------------#
# b代表第几张图片 a代表第几个先验框
# gj代表y轴,gi代表x轴
#-------------------------------------------#
b, a, gj, gi = indices[i]
idx = (b == batch_idx)
b, a, gj, gi = b[idx], a[idx], gj[idx], gi[idx]
all_b.append(b)
all_a.append(a)
all_gj.append(gj)
all_gi.append(gi)
all_anch.append(anch[i][idx])
from_which_layer.append(torch.ones(size=(len(b),)) * i)
#-------------------------------------------#
# 取出这个真实框对应的预测结果
#-------------------------------------------#
fg_pred = prediction[b, a, gj, gi]
p_obj.append(fg_pred[:, 4:5])
p_cls.append(fg_pred[:, 5:])
#-------------------------------------------#
# 获得网格后,进行解码
#-------------------------------------------#
grid = torch.stack([gi, gj], dim=1).type_as(fg_pred)
pxy = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]
pwh = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]
pxywh = torch.cat([pxy, pwh], dim=-1)
pxyxy = self.xywh2xyxy(pxywh)
pxyxys.append(pxyxy)
#-------------------------------------------#
# 判断是否存在对应的预测框,不存在则跳过
#-------------------------------------------#
pxyxys = torch.cat(pxyxys, dim=0)
if pxyxys.shape[0] == 0:
continue
#-------------------------------------------#
# 进行堆叠
#-------------------------------------------#
p_obj = torch.cat(p_obj, dim=0)
p_cls = torch.cat(p_cls, dim=0)
from_which_layer = torch.cat(from_which_layer, dim=0)
all_b = torch.cat(all_b, dim=0)
all_a = torch.cat(all_a, dim=0)
all_gj = torch.cat(all_gj, dim=0)
all_gi = torch.cat(all_gi, dim=0)
all_anch = torch.cat(all_anch, dim=0)
#-------------------------------------------------------------#
# 计算当前图片中,真实框与预测框的重合程度
# iou的范围为0-1,取-log后为0~inf
# 重合程度越大,取-log后越小
# 因此,真实框与预测框重合度越大,pair_wise_iou_loss越小
#-------------------------------------------------------------#
pair_wise_iou = self.box_iou(txyxy, pxyxys)
pair_wise_iou_loss = -torch.log(pair_wise_iou + 1e-8)
#-------------------------------------------#
# 最多二十个预测框与真实框的重合程度
# 然后求和,找到每个真实框对应几个预测框
#-------------------------------------------#
top_k, _ = torch.topk(pair_wise_iou, min(20, pair_wise_iou.shape[1]), dim=1)
dynamic_ks = torch.clamp(top_k.sum(1).int(), min=1)
#-------------------------------------------#
# gt_cls_per_image 种类的真实信息
#-------------------------------------------#
gt_cls_per_image = F.one_hot(this_target[:, 1].to(torch.int64), self.num_classes).float().unsqueeze(1).repeat(1, pxyxys.shape[0], 1)
#-------------------------------------------#
# cls_preds_ 种类置信度的预测信息
# cls_preds_越接近于1,y越接近于1
# y / (1 - y)越接近于无穷大
# 也就是种类置信度预测的越准
# pair_wise_cls_loss越小
#-------------------------------------------#
num_gt = this_target.shape[0]
cls_preds_ = p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_() * p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
y = cls_preds_.sqrt_()
pair_wise_cls_loss = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image, reduction="none").sum(-1)
del cls_preds_
#-------------------------------------------#
# 求cost的总和
#-------------------------------------------#
cost = (
pair_wise_cls_loss
+ 3.0 * pair_wise_iou_loss
)
#-------------------------------------------#
# 求cost最小的k个预测框
#-------------------------------------------#
matching_matrix = torch.zeros_like(cost)
for gt_idx in range(num_gt):
_, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)
matching_matrix[gt_idx][pos_idx] = 1.0
del top_k, dynamic_ks
#-------------------------------------------#
# 如果一个预测框对应多个真实框
# 只使用这个预测框最对应的真实框
#-------------------------------------------#
anchor_matching_gt = matching_matrix.sum(0)
if (anchor_matching_gt > 1).sum() > 0:
_, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
matching_matrix[:, anchor_matching_gt > 1] *= 0.0
matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0
fg_mask_inboxes = matching_matrix.sum(0) > 0.0
matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)
#-------------------------------------------#
# 取出符合条件的框
#-------------------------------------------#
from_which_layer = from_which_layer[fg_mask_inboxes]
all_b = all_b[fg_mask_inboxes]
all_a = all_a[fg_mask_inboxes]
all_gj = all_gj[fg_mask_inboxes]
all_gi = all_gi[fg_mask_inboxes]
all_anch = all_anch[fg_mask_inboxes]
this_target = this_target[matched_gt_inds]
for i in range(num_layer):
layer_idx = from_which_layer == i
matching_bs[i].append(all_b[layer_idx])
matching_as[i].append(all_a[layer_idx])
matching_gjs[i].append(all_gj[layer_idx])
matching_gis[i].append(all_gi[layer_idx])
matching_targets[i].append(this_target[layer_idx])
matching_anchs[i].append(all_anch[layer_idx])
for i in range(num_layer):
matching_bs[i] = torch.cat(matching_bs[i], dim=0) if len(matching_bs[i]) != 0 else torch.Tensor(matching_bs[i])
matching_as[i] = torch.cat(matching_as[i], dim=0) if len(matching_as[i]) != 0 else torch.Tensor(matching_as[i])
matching_gjs[i] = torch.cat(matching_gjs[i], dim=0) if len(matching_gjs[i]) != 0 else torch.Tensor(matching_gjs[i])
matching_gis[i] = torch.cat(matching_gis[i], dim=0) if len(matching_gis[i]) != 0 else torch.Tensor(matching_gis[i])
matching_targets[i] = torch.cat(matching_targets[i], dim=0) if len(matching_targets[i]) != 0 else torch.Tensor(matching_targets[i])
matching_anchs[i] = torch.cat(matching_anchs[i], dim=0) if len(matching_anchs[i]) != 0 else torch.Tensor(matching_anchs[i])
return matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs
3. 计算loss
- Reg
- 已知每个真实框对应的先验框
- 获取真实框对应的先验框之后,取出先验框对应的预测框
- 真实框与预测框 计算CIOU损失
- 作为Reg部分的Loss组成
- Obj 正样本和是否包含物体的预测结果计算交叉熵
- 已知每个真实框对应的先验框
- 所有真实框对应的先验框都是正样本,剩余的是负样本
- 根据正负样本 和 特征点的是否包含物体的预测结果 ----->计算交叉熵损失
- 作为Obj部分的Loss组成
- cls 真实框种类和先验框的种类预测结果 计算交叉熵
- 已知每个真实框对应的先验框
- 获取真实框对应的先验框之后,取出先验框的种类预测结果
- 根据真实框的种类 和 先验框的种类预测结果 ------->计算交叉熵损失
- 作为Cls部分的Loss组成
Loss代码
def build_targets(self, predictions, targets, imgs):
#-------------------------------------------#
# 匹配正样本
#-------------------------------------------#
indices, anch = self.find_3_positive(predictions, targets)
matching_bs = [[] for _ in predictions]
matching_as = [[] for _ in predictions]
matching_gjs = [[] for _ in predictions]
matching_gis = [[] for _ in predictions]
matching_targets = [[] for _ in predictions]
matching_anchs = [[] for _ in predictions]
#-------------------------------------------#
# 一共三层
#-------------------------------------------#
num_layer = len(predictions)
#-------------------------------------------#
# 对batch_size进行循环,进行OTA匹配
# 在batch_size循环中对layer进行循环
#-------------------------------------------#
for batch_idx in range(predictions[0].shape[0]):
#-------------------------------------------#
# 先判断匹配上的真实框哪些属于该图片
#-------------------------------------------#
b_idx = targets[:, 0]==batch_idx
this_target = targets[b_idx]
#-------------------------------------------#
# 如果没有真实框属于该图片则continue
#-------------------------------------------#
if this_target.shape[0] == 0:
continue
#-------------------------------------------#
# 真实框的坐标进行缩放
#-------------------------------------------#
txywh = this_target[:, 2:6] * imgs[batch_idx].shape[1]
#-------------------------------------------#
# 从中心宽高到左上角右下角
#-------------------------------------------#
txyxy = self.xywh2xyxy(txywh)
pxyxys = []
p_cls = []
p_obj = []
from_which_layer = []
all_b = []
all_a = []
all_gj = []
all_gi = []
all_anch = []
#-------------------------------------------#
# 对三个layer进行循环
#-------------------------------------------#
for i, prediction in enumerate(predictions):
#-------------------------------------------#
# b代表第几张图片 a代表第几个先验框
# gj代表y轴,gi代表x轴
#-------------------------------------------#
b, a, gj, gi = indices[i]
idx = (b == batch_idx)
b, a, gj, gi = b[idx], a[idx], gj[idx], gi[idx]
all_b.append(b)
all_a.append(a)
all_gj.append(gj)
all_gi.append(gi)
all_anch.append(anch[i][idx])
from_which_layer.append(torch.ones(size=(len(b),)) * i)
#-------------------------------------------#
# 取出这个真实框对应的预测结果
#-------------------------------------------#
fg_pred = prediction[b, a, gj, gi]
p_obj.append(fg_pred[:, 4:5])
p_cls.append(fg_pred[:, 5:])
#-------------------------------------------#
# 获得网格后,进行解码
#-------------------------------------------#
grid = torch.stack([gi, gj], dim=1).type_as(fg_pred)
pxy = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]
pwh = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]
pxywh = torch.cat([pxy, pwh], dim=-1)
pxyxy = self.xywh2xyxy(pxywh)
pxyxys.append(pxyxy)
#-------------------------------------------#
# 判断是否存在对应的预测框,不存在则跳过
#-------------------------------------------#
pxyxys = torch.cat(pxyxys, dim=0)
if pxyxys.shape[0] == 0:
continue
#-------------------------------------------#
# 进行堆叠
#-------------------------------------------#
p_obj = torch.cat(p_obj, dim=0)
p_cls = torch.cat(p_cls, dim=0)
from_which_layer = torch.cat(from_which_layer, dim=0)
all_b = torch.cat(all_b, dim=0)
all_a = torch.cat(all_a, dim=0)
all_gj = torch.cat(all_gj, dim=0)
all_gi = torch.cat(all_gi, dim=0)
all_anch = torch.cat(all_anch, dim=0)
#-------------------------------------------------------------#
# 计算当前图片中,真实框与预测框的重合程度
# iou的范围为0-1,取-log后为0~inf
# 重合程度越大,取-log后越小
# 因此,真实框与预测框重合度越大,pair_wise_iou_loss越小
#-------------------------------------------------------------#
pair_wise_iou = self.box_iou(txyxy, pxyxys)
pair_wise_iou_loss = -torch.log(pair_wise_iou + 1e-8)
#-------------------------------------------#
# 最多二十个预测框与真实框的重合程度
# 然后求和,找到每个真实框对应几个预测框
#-------------------------------------------#
top_k, _ = torch.topk(pair_wise_iou, min(20, pair_wise_iou.shape[1]), dim=1)
dynamic_ks = torch.clamp(top_k.sum(1).int(), min=1)
#-------------------------------------------#
# gt_cls_per_image 种类的真实信息
#-------------------------------------------#
gt_cls_per_image = F.one_hot(this_target[:, 1].to(torch.int64), self.num_classes).float().unsqueeze(1).repeat(1, pxyxys.shape[0], 1)
#-------------------------------------------#
# cls_preds_ 种类置信度的预测信息
# cls_preds_越接近于1,y越接近于1
# y / (1 - y)越接近于无穷大
# 也就是种类置信度预测的越准
# pair_wise_cls_loss越小
#-------------------------------------------#
num_gt = this_target.shape[0]
cls_preds_ = p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_() * p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
y = cls_preds_.sqrt_()
pair_wise_cls_loss = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image, reduction="none").sum(-1)
del cls_preds_
#-------------------------------------------#
# 求cost的总和
#-------------------------------------------#
cost = (
pair_wise_cls_loss
+ 3.0 * pair_wise_iou_loss
)
#-------------------------------------------#
# 求cost最小的k个预测框
#-------------------------------------------#
matching_matrix = torch.zeros_like(cost)
for gt_idx in range(num_gt):
_, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)
matching_matrix[gt_idx][pos_idx] = 1.0
del top_k, dynamic_ks
#-------------------------------------------#
# 如果一个预测框对应多个真实框
# 只使用这个预测框最对应的真实框
#-------------------------------------------#
anchor_matching_gt = matching_matrix.sum(0)
if (anchor_matching_gt > 1).sum() > 0:
_, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
matching_matrix[:, anchor_matching_gt > 1] *= 0.0
matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0
fg_mask_inboxes = matching_matrix.sum(0) > 0.0
matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)
#-------------------------------------------#
# 取出符合条件的框
#-------------------------------------------#
from_which_layer = from_which_layer[fg_mask_inboxes]
all_b = all_b[fg_mask_inboxes]
all_a = all_a[fg_mask_inboxes]
all_gj = all_gj[fg_mask_inboxes]
all_gi = all_gi[fg_mask_inboxes]
all_anch = all_anch[fg_mask_inboxes]
this_target = this_target[matched_gt_inds]
for i in range(num_layer):
layer_idx = from_which_layer == i
matching_bs[i].append(all_b[layer_idx])
matching_as[i].append(all_a[layer_idx])
matching_gjs[i].append(all_gj[layer_idx])
matching_gis[i].append(all_gi[layer_idx])
matching_targets[i].append(this_target[layer_idx])
matching_anchs[i].append(all_anch[layer_idx])
for i in range(num_layer):
matching_bs[i] = torch.cat(matching_bs[i], dim=0) if len(matching_bs[i]) != 0 else torch.Tensor(matching_bs[i])
matching_as[i] = torch.cat(matching_as[i], dim=0) if len(matching_as[i]) != 0 else torch.Tensor(matching_as[i])
matching_gjs[i] = torch.cat(matching_gjs[i], dim=0) if len(matching_gjs[i]) != 0 else torch.Tensor(matching_gjs[i])
matching_gis[i] = torch.cat(matching_gis[i], dim=0) if len(matching_gis[i]) != 0 else torch.Tensor(matching_gis[i])
matching_targets[i] = torch.cat(matching_targets[i], dim=0) if len(matching_targets[i]) != 0 else torch.Tensor(matching_targets[i])
matching_anchs[i] = torch.cat(matching_anchs[i], dim=0) if len(matching_anchs[i]) != 0 else torch.Tensor(matching_anchs[i])
return matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs
训练设置–训练自己的YOLOv7
- 打开后的根目录是文件存放的目录
在完成数据集的摆放之后,我们需要对数据集进行下一步的处理,目的是获得训练用的2007_train.txt以及2007_val.txt,需要用到根目录下的voc_annotation.py。
voc_annotation.py里面有一些参数需要设置。
分别是annotation_mode、classes_path、trainval_percent、train_percent、VOCdevkit_path,第一次训练可以仅修改classes_path
'''
annotation_mode用于指定该文件运行时计算的内容
annotation_mode为0代表整个标签处理过程,包括获得VOCdevkit/VOC2007/ImageSets里面的txt以及训练用的2007_train.txt、2007_val.txt
annotation_mode为1代表获得VOCdevkit/VOC2007/ImageSets里面的txt
annotation_mode为2代表获得训练用的2007_train.txt、2007_val.txt
'''
annotation_mode = 0
'''
必须要修改,用于生成2007_train.txt、2007_val.txt的目标信息
与训练和预测所用的classes_path一致即可
如果生成的2007_train.txt里面没有目标信息
那么就是因为classes没有设定正确
仅在annotation_mode为0和2的时候有效
'''
classes_path = 'model_data/voc_classes.txt'
'''
trainval_percent用于指定(训练集+验证集)与测试集的比例,默认情况下 (训练集+验证集):测试集 = 9:1
train_percent用于指定(训练集+验证集)中训练集与验证集的比例,默认情况下 训练集:验证集 = 9:1
仅在annotation_mode为0和1的时候有效
'''
trainval_percent = 0.9
train_percent = 0.9
'''
指向VOC数据集所在的文件夹
默认指向根目录下的VOC数据集
'''
VOCdevkit_path = 'VOCdevkit'
- 权值会生成在logs文件夹中
结果预测
- 用到的文件
- yolo.py
- 修改model_path(权值文件)以及classes_path(类别文件的地址)
- predict.py
- yolo.py
注意
训练
- 格式满足要求
- 图片不需要固定大小,传入图片会进行resize
- 灰度图自动转换成RGB图片进行训练
- 图片如果后缀不是jpg,需要进行批量转换
- 标签是.xml格式
- 损失值的大小用于判断是否收敛
比较重要的是有收敛的趋势–验证集损失不断下降
如果验证机的损失基本上不改变的话,模型基本上就收敛了
损失值并不是越接近与0越好(只要收敛了就ok了):如果想要损失好看点 对应的损失函数除以10000(牛皮!!!)
训练过程的损失值保存在 logs文件夹下的 loss_%Y_%m_%d_%H_%M_%S文件夹中 - 训练好的权值文件保存在log文件夹中
每个训练迭代:包含若干训练步长 每个训练步长进行一次梯度下降
如果只训练了几个step是不会保存的
预训练权重
-
对于不同的数据集是通用的,因为特征是通用的
-
重要部分:主干特征提取网络的权值部分,用于进行特征提取
-
99%都是需要的—>不用的话,主干部分的权值比较随机 特征提取不明显 网络训练效果不好
-
如果训练过程存在中断,将model_path设置成log下的权值文件,再次载入训练
同时修改:冻结阶段/解冻阶段的参数–>保证模型epoch的连续性model_path = “” 时候,不加载 整个模型的权值
- 如果想让模型从0开始,则设置model_path = “” 。下面的下面的Freeze_Train = Fasle。---->此时训练从0开始训练,且没有冻结主干的过程
- 一般网络从0开始训练,效果比较差(权值比较随机) 特征提取不明显
-
网络从0开始训练的两个方案:
- Mosaic数据增强 UnFreeze_epoch设置比较大(300以上)、batch较大(16以上)、数据较多 可以设置Mosaic=True,直接初始化参数开始训练,但是得到效果仍然不如有与训练的情况
- 了解imagenet数据集,首先训练分类模型,获得网络的主干部分权值,分类模型的 主干部分 和该模型通用,基于此进行训练。
-
设置了model_path ,主干的权值不需要加载
-
没有设置model_path ,pretrained = True仅加载主干开始训练
-
没有设置model_path ,pretrained = False Freeze_Train = Fasle 此时从0开始训练,且没有冻结主干的过程
冻结阶段
-
目的:性能不足的时候使用
-
冻结训练需要的显存小,显卡比较差
设置Freeze_Epoch=UnFreeze_Epoch Freeze_Train = True
此时只进行冻结训练 -
参数建议:
-
整个模型的预训练权重开始训练
-
SGD:
Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 300,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(冻结) # Init_Epoch = 0,UnFreeze_Epoch = 300,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(不冻结)
-
-
从0开始训练
-
batch_size设置
- 显卡能够接受的情况下以大为好
- 显存不足与数据集无关
- 提示:OOM或者CUDA out of memory 调小batch_size
- 受到batchnorm层的影响,batchsize最小是2,不能是1
- 正常情况下Freeze_Epoch是UnFreeze_Epoch的1-2倍。差距过大,会影响学习率的自动调整
-
解冻阶段
输入图片
-
必须是32的倍数
-
Yolov7的版本l:v7 x:对应yolo_x