目录
1、 结合代码与 三、网络结构 对上图的解读,从而更好了解整个网络
注意,demo和训练时 对输入的处理不同,此为demo过程的前向推理过程
一、输入图片处理
# detect.py - 108
dataset = LoadImages(source, img_size=imgsz, stride=stride, auto=pt, vid_stride=vid_stride)
主要由这个类去处理,然后关键的函数为 这个类中 的
# dataloaders.py - 311
im = letterbox(im0, self.img_size, stride=self.stride, auto=self.auto)[0] # padded resize 调用的函数
im = im.transpose((2, 0, 1))[::-1] # HWC to CHW, BGR to RGB
im = np.ascontiguousarray(im) # contiguous
letterbox 函数,对图片的处理过程为
1、new_shape = (640, 640),其作用为保证输入到网络中的img的h w 最大为640
2、计算 640 和 原img的h w 和 的比例并取最小的比例。注意,这里是 640 / 原img的hw 取最小。也就是以
原img的h w 最大的边为基准,从而保证原img的内容不丢失
3、按照最小的比例 缩小 或 放大,即原img的h w 乘上 比例
4、为了保证 输入到model中的img h w 为stride的整数倍,需要对其进行填充。如下:
dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1] # wh padding
if auto: # minimum rectangle
dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding np.mod 取模运算
得到填充量大小,然后除以2为单边填充量大小
dw /= 2 # divide padding into 2 sides
dh /= 2
5、填充,即得输入到 model 中的 img
举例如下,上面的处理后 上下边有填充。下面的由于reshape后已经是stride的整数倍,不需要padding。
所以,总结 input img 的尺寸并不是固定的640X640,而是保证最大的一边为640,另一边为stride的整数倍即可。
二、前向传播过程 (demo时)
图1
1、 结合代码与 三、网络结构 对上图的解读,从而更好了解整个网络
借用网上的图片,并在其上作了标注(图1所示),其中序号对应着 model 中的模块,model见下面的 三、网络结构 。( 注意,个人对照着看 感觉上图中的8和9反了。 )
前向传播过程主要由下面函数完成
yolo.py ---114
def _forward_once(self, x, profile=False, visualize=False):
y, dt = [], [] # outputs
# print('================')
for m in self.model: # m 与 model <c-8>
# print('====================')
# print("i is {}".format(m.i))
# print('====================')
if m.f != -1: # if not from previous layer
# print("when i is {},f is {}".format(m.i, m.f))
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
x = m(x) # run
y.append(x if m.i in self.save else None) # save output 这里注意 m.i 这个属性
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
return x
其中重要的是 m 这个变量,它循环拿出 model 中的模块,在实验过程种打印了一下什么时候 m.f会不是-1,其中 m.i 就是 模块的序号,打印出的信息如下所示
when i is 12,f is [-1, 6]
when i is 16,f is [-1, 4]
when i is 19,f is [-1, 14]
when i is 22,f is [-1, 10]
when i is 24,f is [17, 20, 23]
可以看出,在 i = 12, 16, 19, 22, 24 时触发 if 条件 ,结合着 图1 以及 下面的 三、 中不难发现,其就是 Contact 模块对应的序号(除了最后一个,为Detect模块,这也正对应着图1),也就是说当到 contact 层时 需要 前面的输出来进行 tensor拼接(torch.cat),而且注意f 的值,-1为前面模块的输出,另一个就是先前的输出。到这里, 代码中的 y 就用上了,debug一下发现 self.save 如下
可以发现 正对应着并且完全包留了 包括 f 中需要 contact 层的输出。所以 y 的作用就体现出来了,保留进行 contact 模块 需要 的对应位置的输出,且 y 中其余都为 None(正如代码中所示,else None)。
结合着 图1,以及 'when i is , f is ' 中的内容,总结如下:
1、⑫ contact 模块 拼接 ⑪和⑥
2、⑯ contact 模块 拼接 ⑮和④
3、⑲ contact 模块 拼接 ⑱和⑭
4、㉒ contact 模块 拼接 ㉑和⑩
5、㉔ Detect head 接收来自 ⑰、⑳ 和 ㉓,所以最终输出三个尺度下的输出
2、 代码中各个模块的定义 以及功能
模块的定义都在 common.py 中
注意:这些模块由于都继承了 nn.Model 父类,所以在执行时 调用的是 forward 函数,当 加载模型时,也就是下面所示
# detect.py ---85
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
跳转到
# common.py ---339
if pt: # PyTorch
model = attempt_load(weights if isinstance(weights, list) else w, device=device, inplace=True, fuse=fuse) # 搭建的网络,来自权重文件
加载模型时已经创建了 model 的实例, 所以各模块已经在这时初始化完毕了,也就是 __init__中的属性 self 已经根据权重文件中传入的参数(比如 输入 输出通道 c1 、c2 ,shortcut等)创建完成。
(1) Conv 模块
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x): # 有 bn 层执行这个
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x): # 无bn层 执行这个
return self.act(self.conv(x))
基础的模块。如三中所示,没有bn层, 所以直接 卷积 + 激活函数,其中 SiLU 激活函数如下所示
导数为 (详见:23种激活函数)
(2) C3 模块
class C3(nn.Module):
# CSP Bottleneck with 3 convolutions
def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1) # optional act=FReLU(c2)
self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))
def forward(self, x):
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))
class Bottleneck(nn.Module):
# Standard bottleneck
def __init__(self, c1, c2, shortcut=True, g=1, e=0.5): # ch_in, ch_out, shortcut, groups, expansion
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c_, c2, 3, 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
这里将 Bottleneck 和 C3放一起了。 C3 首先执行 self.cv1(x)), self.cv2(x),其都为 Conv 模块,然后执行 Bottleneck 模块,该模块里依然包含 Conv模块 ,连续执行两次Conv模块然后将输出与输入加和(结合着参考图1中的该模块),当输入和输出通道不相等是不执行加和。
(3)SSPF模块
class SPPF(nn.Module):
# Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
def __init__(self, c1, c2, k=5): # equivalent to SPP(k=(5, 9, 13))
super().__init__()
c_ = c1 // 2 # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c_ * 4, c2, 1, 1)
self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
def forward(self, x):
x = self.cv1(x)
with warnings.catch_warnings():
warnings.simplefilter('ignore') # suppress torch 1.9.0 max_pool2d() warning
y1 = self.m(x)
y2 = self.m(y1)
return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1)) # 拼接 四个 ,然后 再卷积
依然结合着图1 中的该模块来看。其先执行Conv模块, 然后连续进行三次 最大池化(return之前又进行了一次),保留这三次的结果 并与输入 拼接,然后再执行Conv模块。
(4)Upsample
其通过 hook 直接调用 pytorch库里的上采样类 Upsample
def forward(self, input: Tensor) -> Tensor:
return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners,
recompute_scale_factor=self.recompute_scale_factor)
通过双线性插值扩大特征图的尺寸,完成上采样。
(5) Detect 模块
检测头,他将来自不同尺度下的特征图通道 统一为 255。
class Detect(nn.Module):
# YOLOv5 Detect head for detection models
stride = None # strides computed during build
dynamic = False # force grid reconstruction
export = False # export mode
def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer
super().__init__()
# self.anchors = anchors
self.nc = nc # number of classes 80
self.no = nc + 5 # number of outputs per anchor 85
self.nl = len(anchors) # number of detection layers 3 anchors 为设置的锚框的参数,shape为(3,3,2),表示各层的特征图每个位置设置的锚框数量
self.na = len(anchors[0]) // 2 # number of anchors 3
self.grid = [torch.empty(0) for _ in range(self.nl)] # init grid
self.anchor_grid = [torch.empty(0) for _ in range(self.nl)] # init anchor grid
self.register_buffer('anchors', torch.tensor(anchors).float().view(self.nl, -1, 2)) # shape(nl,na,2)
self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) # output conv
self.inplace = inplace # use inplace ops (e.g. slice assignment)
def forward(self, x): # 举例 x {list:3} Tensor:(1,128,80,80), Tensor:(1,256,40,40), Tensor:(1,512,20,20)
z = [] # inference output
for i in range(self.nl): # 举例 i:0 分通道处理
x[i] = self.m[i](x[i]) # conv 举例 x {list:3} Tensor:(1,255,80,80), Tensor:(1,256,40,40), Tensor:(1,512,20,20)
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85) 举例 bs:1, _ : 255, ny:80, nx:80
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous() # 举例 x {list:3} Tensor:(1,3,80,80,85), Tensor:(1,256,40,40), Tensor:(1,512,20,20)
if not self.training: # inference
if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: # 换输入后重新 设定锚框
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i) # 举例 grid {list:3} Tensor:(1,3,80,80,2),Tensor:(1,3,42,28,2),Tensor:(1,3,21,14,2)
# anchor_grid {list:3} Tensor:(1,3,80,80,2),Tensor:(1,3,42,28,2),Tensor:(1,3,21,14,2)
# 也是按通道处理,只改变当前的,之后的还是原来的还没做改变呢
# 其中 grid 为特征图的坐标, anchor_grid为原图的点坐标
xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4) #
xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i] # xy 结合着锚框的标签设定,逆运算求取 预测 的 xy
wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i] # wh 同上,逆运算 求取 wh
y = torch.cat((xy, wh, conf.sigmoid(), mask), 4) # 最终的预测,这只是一个尺度下的
else: # Detect (boxes only)
xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
xy = (xy * 2 + self.grid[i]) * self.stride[i] # xy
wh = (wh * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, conf), 4)
z.append(y.view(bs, self.na * nx * ny, self.no)) # 全部尺度下的, 整成相应输出的形状
return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)
def _make_grid(self, nx=20, ny=20, i=0, torch_1_10=check_version(torch.__version__, '1.10.0')):
d = self.anchors[i].device
t = self.anchors[i].dtype
shape = 1, self.na, ny, nx, 2 # grid shape
y, x = torch.arange(ny, device=d, dtype=t), torch.arange(nx, device=d, dtype=t)
yv, xv = torch.meshgrid(y, x, indexing='ij') if torch_1_10 else torch.meshgrid(y, x) # torch>=0.7 compatibility
grid = torch.stack((xv, yv), 2).expand(shape) - 0.5 # add grid offset, i.e. y = 2.0 * x - 0.5
anchor_grid = (self.anchors[i] * self.stride[i]).view((1, self.na, 1, 1, 2)).expand(shape) # 乘上stride,反射回原图
return grid, anchor_grid
其中 要注意锚框的设立。如果输入变了就会重新设立。
三、网络结构
其为model中的结构, 见 yolo.py --- 116
for m in self.model:
中 m 拿出的就是其中的层,可见其并不包含 bn层。
Sequential(
(0): Conv(
(conv): Conv2d(3, 32, kernel_size=(6, 6), stride=(2, 2), padding=(2, 2))
(act): SiLU(inplace=True)
)
(1): Conv(
(conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(2): C3(
(cv1): Conv(
(conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(3): Conv(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(4): C3(
(cv1): Conv(
(conv): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
(1): Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(5): Conv(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(6): C3(
(cv1): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
(1): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
(2): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(7): Conv(
(conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(8): C3(
(cv1): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(9): SPPF(
(cv1): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): MaxPool2d(kernel_size=5, stride=1, padding=2, dilation=1, ceil_mode=False)
)
(10): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(11): Upsample(scale_factor=2.0, mode=nearest)
(12): Concat()
(13): C3(
(cv1): Conv(
(conv): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(14): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(15): Upsample(scale_factor=2.0, mode=nearest)
(16): Concat()
(17): C3(
(cv1): Conv(
(conv): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(18): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(19): Concat()
(20): C3(
(cv1): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(21): Conv(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU(inplace=True)
)
(22): Concat()
(23): C3(
(cv1): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv3): Conv(
(conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU(inplace=True)
)
(cv2): Conv(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU(inplace=True)
)
)
)
)
(24): Detect(
(m): ModuleList(
(0): Conv2d(128, 255, kernel_size=(1, 1), stride=(1, 1))
(1): Conv2d(256, 255, kernel_size=(1, 1), stride=(1, 1))
(2): Conv2d(512, 255, kernel_size=(1, 1), stride=(1, 1))
)
)
)