YOLOv5改进(一) 本文(7万字) | 添加注意力机制 | SE | CBAM | ECA | CA | SimAM | S2-MLPv2 | NAMAttention | 等 | 共计二十种 |

小酒馆燃着灯

已于 2024-01-29 01:15:11 修改

阅读量4k

点赞数 23

文章标签： YOLO 深度学习人工智能

于 2023-12-26 17:40:52 首次发布

本文链接：https://blog.csdn.net/weixin_44302770/article/details/135227648

版权

人工智能专栏计划专栏收录该内容

142 篇文章

已下架不支持订阅

本文详细介绍了如何在YOLOv5中添加20种不同的注意力机制，如SE、CBAM、ECA等，通过论文、原理、代码实现全方位解析。文章提供代码资源，包括在C3模块中添加注意力机制的两种方法，并给出了注意力配置文件的写法表，方便读者实践和理解。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

代码函数调用关系图(全网最详尽-重要)

因文档特殊，不能在博客正确显示，请移步以下链接！

图解YOLOv5_v7.0代码结构与调用关系(点击进入可以放大缩小等操作)

预览：
在这里插入图片描述

文章目录

注意力机制
注意力机制的分类
- 1. SE 注意力模块
- - 1.1 原理
  - 1.2 代码
- 2. CBAM 注意力模块
- - 2.1 原理
  - 2.2 代码
- 3. ECA 注意力模块
- - 3.1 原理
  - 3.2 代码
- 4. CA 注意力模块
- - 4.1 原理
  - 4.2 代码
- 5. 添加方式
- 6. 添加方式的补充说明
- - 7. SimAM 注意力模块
  - 7.1 原理
  - 7.2 代码
- 8. S2-MLPv2 注意力模块
- - 4.1 原理
  - 8.2 代码
- 9. NAMAttention 注意力模块
- - 9.1 原理
  - 9.2 代码
- 10. Criss-CrossAttention 注意力模块
- - 10.1 原理
  - 10.2 代码
- 12. Selective Kernel Attention 注意力模块
- - 12.1 原理
  - 12.2 代码
- 11. GAMAttention 注意力模块
- - 11.1 原理
  - 11.2 代码
- 14. A2-Net 注意力模块
- - 14.1 原理
  - - 14.2 代码
在C3模块中加入注意力机制
- 1.第一版本添加方式介绍
- - 1.1 C3SE
  - 1.2 C3CA
  - 1.3 C3CBAM
  - 1.4 C3ECA
- 2.第二版本添加方式介绍
- - 2.1 C3_SE_Attention
  - 2.2 C3_ECA_Attention
  - 2.3 C3_CBAM_Attention
  - 2.4 C3_CoorAtt_Attention
更多注意力机制及代码
- 前言
- 注意力代码
- yolo.py
- 注意力配置文件写法表
- - 写法表如何使用？
- YOLOv5模板
- - yolov5-template-Backbone
  - yolov5-template-Neck
  - yolov5-template-SPP

获取添加完成的注意力机制代码：

https://pan.baidu.com/s/1UMnCAhmyBkNc9ZV2ByWZLg?pwd=gpbk

注意力机制

注意力机制（Attention Mechanism）源于对人类视觉的研究。在认知科学中，由于信息处理的瓶颈，人类会选择性地关注所有信息的一部分，同时忽略其他可见的信息。为了合理利用有限的视觉信息处理资源，人类需要选择视觉区域中的特定部分，然后集中关注它。例如，人们在阅读时，通常只有少量要被读取的词会被关注和处理。综上，注意力机制主要有两个方面：决定需要关注输入的哪部分；分配有限的信息处理资源给重要的部分。这几年有关attention的论文与日俱增，下图就显示了在包括CVPR、ICCV、ECCV、NeurIPS、ICML和ICLR在内的顶级会议中，与attention相关的论文数量的增加量。下面我将会分享Yolov5 v6.1如何添加注意力机制；并分享到2022年4月为止，30个顶会上提出的优秀的attention.

在这里插入图片描述

可视化图表显示了顶级会议中与注意力相关的论文数量的增加量，包括CVPR，ICCV，ECCV，NeurIPS，ICML和ICLR。

注意力机制的分类

在这里插入图片描述

注意力机制分类图

1. SE 注意力模块

论文名称：《Squeeze-and-Excitation Networks》

论文地址：https://arxiv.org/pdf/1709.01507.pdf

代码地址： https://github.com/hujie-frank/SENet

1.1 原理

SEnet（Squeeze-and-Excitation Network）考虑了特征通道之间的关系，在特征通道上加入了注意力机制。

SEnet通过学习的方式自动获取每个特征通道的重要程度，并且利用得到的重要程度来提升特征并抑制对当前任务不重要的特征。SEnet 通过Squeeze模块和Exciation模块实现所述功能。

在这里插入图片描述

如图所示，首先作者通过squeeze操作，对空间维度进行压缩，直白的说就是对每个特征图做全局池化，平均成一个实数值。该实数从某种程度上来说具有全局感受野。作者提到该操作能够使得靠近数据输入的特征也可以具有全局感受野，这一点在很多的任务中是非常有用的。紧接着就是excitaton操作，由于经过squeeze操作后，网络输出了 $1 * 1 * C$ 大小的特征图，作者利用权重 $w$ 来学习 $C$ 个通道直接的相关性。在实际应用时有的框架使用全连接，有的框架使用 $1 * 1$ 的卷积实现。该过程中作者先对 $C$ 个通道降维再扩展回 $C$ 通道。好处就是一方面降低了网络计算量，一方面增加了网络的非线性能力。最后一个操作时将exciation的输出看作是经过特征选择后的每个通道的重要性，通过乘法加权的方式乘到先前的特征上，从事实现提升重要特征，抑制不重要特征这个功能。

1.2 代码

# SE
import torch.nn as nn

class SE(nn.Module):
    def __init__(self, c1, c2, ratio=16):
        super(SE, self).__init__()
        # 创建一个 Squeeze-and-Excitation (SE) 模块
        # c1: 输入通道数，c2: 输出通道数，ratio: SE 模块中的通道缩放比例
        # 通过自适应平均池化将输入特征图的空间维度减小到 1x1
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        # 第一个全连接层，用于降低通道数
        self.l1 = nn.Linear(c1, c1 // ratio, bias=False)
        # 非线性激活函数 ReLU
        self.relu = nn.ReLU(inplace=True)
        # 第二个全连接层，用于将通道数恢复到原始输入通道数
        self.l2 = nn.Linear(c1 // ratio, c1, bias=False)
        # Sigmoid 激活函数，将输出值缩放到 [0, 1] 范围内
        self.sig = nn.Sigmoid()
    
    def forward(self, x):
        # x: 输入的特征图
        b, c, _, _ = x.size()  # 获取输入特征图的形状信息：(批量大小, 通道数, 高度, 宽度)
        # 通过平均池化操作将特征图降维到 (b, c) 形状
        y = self.avgpool(x).view(b, c)       
        # 使用第一个全连接层进行通道缩放
        y = self.l1(y)      
        # 非线性激活函数 ReLU
        y = self.relu(y)        
        # 使用第二个全连接层进行通道恢复
        y = self.l2(y)     
        # 使用 Sigmoid 激活函数将输出缩放到 [0, 1] 范围内
        y = self.sig(y)        
        # 将输出特征图的形状还原为 (b, c, 1, 1)，以便与输入特征图相乘
        y = y.view(b, c, 1, 1)      
        # 将原始输入特征图与缩放后的特征图相乘，以获得加权特征图
        return x * y.expand_as(x)

这里放上我自己做实验的截图，我就是把SE层加到了第 9 层的位置；粉红色线条代表添加了SE注意力机制。

在这里插入图片描述

2. CBAM 注意力模块

论文题目：《CBAM: Convolutional Block Attention Module》

论文地址：https://arxiv.org/pdf/1807.06521.pdf

2.1 原理

CBAM（Convolutional Block Attention Module）结合了特征通道和特征空间两个维度的注意力机制。

在这里插入图片描述

CBAM通过学习的方式自动获取每个特征通道的重要程度，和SEnet类似。此外还通过类似的学习方式自动获取每个特征空间的重要程度。并且利用得到的重要程度来提升特征并抑制对当前任务不重要的特征。

在这里插入图片描述

CBAM提取特征通道注意力的方式基本和SEnet类似，如下Channel Attention中的代码所示，其在SEnet的基础上增加了max_pool的特征提取方式，其余步骤是一样的。将通道注意力提取厚的特征作为空间注意力模块的输入。

在这里插入图片描述

CBAM提取特征空间注意力的方式：经过ChannelAttention后，最终将经过通道重要性选择后的特征图送入特征空间注意力模块，和通道注意力模块类似，空间注意力是以通道为单位进行最大池化和平均池化，并将两者的结果进行concat，之后再一个卷积降成 $1 * w * h$ 的特征图空间权重，再将该权重和输入特征进行点积，从而实现空间注意力机制。

2.2 代码

# CBAM
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from PIL import Image


class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        # 通道注意力模块，用于增强通道间的特征关系
        # in_planes: 输入特征图的通道数，ratio: 通道压缩比例
        # 自适应平均池化和自适应最大池化，用于捕获全局通道信息
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        # 第一个卷积层，用于通道压缩
        self.f1 = nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False)
        self.relu = nn.ReLU()

        # 第二个卷积层，用于通道恢复
        self.f2 = nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False)

        # Sigmoid 激活函数，将通道注意力权重缩放到 [0, 1] 范围内
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # 平均池化和最大池化后，通过两个卷积层进行通道注意力计算
        avg_out = self.f2(self.relu(self.f1(self.avg_pool(x))))
        max_out = self.f2(self.relu(self.f1(self.max_pool(x))))

        # 将平均池化和最大池化的结果相加，并通过 Sigmoid 缩放得到最终的通道注意力权重
        out = self.sigmoid(avg_out + max_out)

        return out


class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        assert kernel_size in (3, 7), 'kernel size must be 3 or 7'
        padding = 3 if kernel_size == 7 else 1

        # 空间注意力模块，用于增强特征图的空间关系
        # kernel_size: 空间注意力操作的卷积核大小，padding 根据 kernel_size 自动确定
        # 计算平均值和最大值，并进行通道融合
        self.conv = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False)

        # Sigmoid 激活函数，将空间注意力权重缩放到 [0, 1] 范围内
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # 计算特征图的平均值和最大值
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)

        # 将平均值和最大值在通道维度上拼接，用于进行空间注意力操作
        x = torch.cat([avg_out, max_out], dim=1)

        # 通过卷积操作并通过 Sigmoid 缩放得到最终的空间注意力权重
        x = self.conv(x)

        return self.sigmoid(x)


class CBAM(nn.Module):
    def __init__(self, c1, c2, ratio=16, kernel_size=7):
        super(CBAM, self).__init__()
        # 组合了通道注意力和空间注意力的CBAM模块
        # c1: 输入特征图的通道数，c2: 输出特征图的通道数，ratio: 通道注意力中的压缩比例，kernel_size: 空间注意力中的卷积核大小

        # 创建通道注意力模块
        self.channel_attention = ChannelAttention(c1, ratio)

        # 创建空间注意力模块
        self.spatial_attention = SpatialAttention(kernel_size)

    def forward(self, x):
        # 首先应用通道注意力，然后应用空间注意力，得到最终的 CBAM 特征图
        out = self.channel_attention(x) * x  # 通过通道注意力权重缩放通道
        out = self.spatial_attention(out) * out  # 通过空间注意力权重缩放空间

        return out


def test_cbam():
    c1, c2 = 64, 64
    ratio = 16
    kernel_size = 7
    cbam = CBAM(c1, c2, ratio, kernel_size)
    dummy_input = torch.randn(1, c1, 32, 32)
    output = cbam(dummy_input)

    print("Input shape:", dummy_input.shape)
    print("Output shape:", output.shape)

# Run the test
test_cbam()

3. ECA 注意力模块

论文名称：《ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks》

论文地址：https://arxiv.org/abs/1910.03151

代码地址：https://github.com/BangguWu/ECANet

3.1 原理

先前的方法大多致力于开发更复杂的注意力模块，以实现更好的性能，这不可避免地增加了模型的复杂性。为了克服性能和复杂性之间的矛盾，作者提出了一种有效的通道关注（ECA）模块，该模块只增加了少量的参数，却能获得明显的性能增益。

在这里插入图片描述

3.2 代码

import torch
import torch.nn as nn

class ECA(nn.Module):
    def __init__(self, c1, c2, k_size=3):
        super(ECA, self).__init__()
        # ECA模块构造函数
        # c1: 输入特征图的通道数，c2: 输出特征图的通道数，k_size: ECA模块中的卷积核大小

        # 自适应平均池化层，用于计算全局平均值
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        
        # 1D卷积层，用于计算通道注意力权重
        self.conv = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False)
        
        # Sigmoid激活函数，将通道注意力权重缩放到 [0, 1] 范围内
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # 在全局空间信息上计算特征描述符
        y = self.avg_pool(x)
        
        # 在特征描述符上应用1D卷积操作
        y = self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
        
        # 多尺度信息融合
        y = self.sigmoid(y)

        # 将通道注意力权重应用到输入特征图上并返回结果
        return x * y.expand_as(x)

4. CA 注意力模块

论文名称：《Coordinate Attention for Efficient Mobile Network Design》

论文地址：https://arxiv.org/abs/2103.02907

4.1 原理

先前的轻量级网络的注意力机制大都采用SE模块，仅考虑了通道间的信息，忽略了位置信息。尽管后来的BAM和CBAM尝试在降低通道数后通过卷积来提取位置注意力信息，但卷积只能提取局部关系，缺乏长距离关系提取的能力。为此，论文提出了新的高效注意力机制coordinate attention（CA），能够将横向和纵向的位置信息编码到channel attention中，使得移动网络能够关注大范围的位置信息又不会带来过多的计算量。

coordinate attention的优势主要有以下几点：

不仅获取了通道间信息，还考虑了方向相关的位置信息，有助于模型更好地定位和识别目标；
足够灵活和轻量，能够简单地插入移动网络的核心结构中；
可以作为预训练模型用于多种任务中，如检测和分割，均有不错的性能提升。

在这里插入图片描述

4.2 代码

import torch
import torch.nn as nn


class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)

    def forward(self, x):
        return self.relu(x + 3) / 6


class h_swish(nn.Module):
    def __init__(self, inplace=True):
        super(h_swish, self).__init__()
        self.sigmoid = h_sigmoid(inplace=inplace)

    def forward(self, x):
        return x * self.sigmoid(x)


class CoordAtt(nn.Module):
    def __init__(self, inp, oup, reduction=32):
        super(CoordAtt, self).__init__()
        self.pool_h = nn.AdaptiveAvgPool2d((None, 1))  # 自适应平均池化，垂直方向
        self.pool_w = nn.AdaptiveAvgPool2d((1, None))  # 自适应平均池化，水平方向
        mip = max(8, inp // reduction)  # 计算中间通道数
        self.conv1 = nn.Conv2d(inp, mip, kernel_size=1, stride=1, padding=0)  # 1x1卷积层
        self.bn1 = nn.BatchNorm2d(mip)  # 批归一化
        self.act = h_swish()  # 使用自定义的h_swish激活函数
        self.conv_h = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)  # 1x1卷积层，垂直方向
        self.conv_w = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)  # 1x1卷积层，水平方向

    def forward(self, x):
        identity = x  # 保存输入的原始数据
        n, c, h, w = x.size()
        # 垂直方向池化
        x_h = self.pool_h(x)
        # 水平方向池化，并且将维度调整为Cx1xH
        x_w = self.pool_w(x).permute(0, 1, 3, 2)
        y = torch.cat([x_h, x_w], dim=2)  # 将两个池化结果在维度2上进行拼接，变成Cx1x(H+W)
        y = self.conv1(y)  # 通过1x1卷积层
        y = self.bn1(y)  # 批归一化
        y = self.act(y)  # 使用h_swish激活函数
        x_h, x_w = torch.split(y, [h, w], dim=2)  # 将结果分割成垂直和水平部分
        x_w = x_w.permute(0, 1, 3, 2)  # 调整水平部分的维度为Cx1xW
        a_h = self.conv_h(x_h).sigmoid()  # 通过1x1卷积并使用sigmoid激活函数，得到垂直方向的注意力权重
        a_w = self.conv_w(x_w).sigmoid()  # 通过1x1卷积并使用sigmoid激活函数，得到水平方向的注意力权重
        out = identity * a_w * a_h  # 使用注意力权重来加权原始输入
        return out


# 创建一个虚拟的输入张量，假设通道数为64，高度为32，宽度为32
input_tensor = torch.randn(1, 64, 32, 32)

# 创建CoordAtt模块，假设输入通道数为64，输出通道数为64，reduction为32
coord_att_module = CoordAtt(64, 64, reduction=32)

# 将输入张量传递给CoordAtt模块进行前向传播
output_tensor = coord_att_module(input_tensor)

# 打印输出张量的形状，以验证模块的功能
print("Input shape:", input_tensor.shape)
print("Output shape:", output_tensor.shape)

5. 添加方式

大致的修改方式如下：

在YOLOv5中添加注意力机制可分为如下 5 步，以在 yolov5s 中添加 SE 注意力机制为例子：

在yolov5/models文件夹下新建一个 yolov5s_SE.yaml ；
将本文上面提供的 SE 注意力代码添加到 common.py 文件末尾；
将 SE 这个类的名字加入到 yolov5/models/yolo.py 中;
修改 yolov5s_SE.yaml ，将 SE 注意力加到你想添加的位置；
修改 train.py 文件的 '--cfg' 默认参数，随后就可以开始训练了。

详细的修改方式如下：

第 1 步：在yolov5/models文件夹下新建一个 yolov5_SE.yaml ，将 yolov5s.yaml 文件内容拷贝粘贴到我们新建的 yolov5s_SE.yaml 文件中等待第 4 步使用；
第 2 步：将本文上面提供的 SE 注意力代码添加到 yolov5/models/common.py 文件末尾；

class SE(nn.Module):
    def __init__(self, c1, c2, ratio=16):
        super(SE, self).__init__()
        #c*1*1
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.l1 = nn.Linear(c1, c1 // ratio, bias=False)
        self.relu = nn.ReLU(inplace=True)
        self.l2 = nn.Linear(c1 // ratio, c1, bias=False)
        self.sig = nn.Sigmoid()
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avgpool(x).view(b, c)
        y = self.l1(y)
        y = self.relu(y)
        y = self.l2(y)
        y = self.sig(y)
        y = y.view(b, c, 1, 1)
        return x * y.expand_as(x)

第 3 步：将 SE 这个类的名字加入到 yolov5/models/yolo.py 如下位置;

在这里插入图片描述

你的可能和我有点区别，不用在意

第 4 步：修改 yolov5s_SE.yaml ，将 SE 注意力加到你想添加的位置；常见的位置有 C3 模块后面， Neck 中，也可以在主干的 SPPF 前添加一层；我这里演示添加到 SPPF 上一层：
将 [-1, 1, SE,[1024]], 添加到 SPPF 的上一层，即下图中所示位置：

在这里插入图片描述

加到这里还没完，还有两个细节需要注意！🌟

当在网络中添加了新的层之后，那么该层网络后续的层的编号都会发生改变，看下图，原本Detect 指定的是[17,20,23] 层，所以在我们添加了 SE 注意力层之后也要对 Detect 的参数进行修改，即原来的 17 层变成了 18 层；原来的 20 层变成了 21 层；原来的 23 层变成了 24 层；所以 Detecet的 from 系数要改为[18,21,24]

在这里插入图片描述

左侧是原始的 yolov5s.yaml ，右侧为修改后的 yolov5s_SE.yaml

同样的，Concat 的 from 系数也要修改，这样才能保持原网络结构不发生特别大的改变，我们刚才把 SE 层加到了第 9 层，所以第 9 层之后的编号都会加 1 ，这里我们要把后面两个 Concat 的 from 系数分别由 [−1,14],[−1,10] 改为 [−1,15],[−1,11]

在这里插入图片描述

左侧是原始的 yolov5s.yaml ，右侧为修改后的 yolov5s_SE.yaml

如果这一步的原理大家没看懂的话，可以看看哔哩哔哩视频，讲解了yaml文件的原理：点击跳转

第 5 步：修改 train.py 文件的 '--cfg' 默认参数，在'--cfg' 后的 default= 后面加上 yolov5s_SE.yaml 的路径，随后就可以开始训练了。

在训练时会打印模型的结构，当出现下面的结构时，就代表我们添加成功了：

最后放上我加入 SE 注意力层后完整的配置文件 yolov5s_SE.yaml

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone+SE
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SE, [1024]], #SE
   [-1, 1, SPPF, [1024, 5]],  # 10
  ]

# YOLOv5+SE v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 14

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 18 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 15], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 21 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 11], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 24 (P5/32-large)

   [[18, 21, 24], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

6. 添加方式的补充说明

可能到这里大家有一个误区，错误的认为只要是注意力模块就把模块名加 yolo.py 的如下位置就行了，其实并不是这样！

在这里插入图片描述

加到哪里？怎么加？这取决于你模块的写法。

下面我介绍另一种添加方式，这种添加方式并不需要在上图位置添加模块名，需要我们另外写一条elif语句，在 python 语言中，缩进是很重要的，有很多同学会忽略这一点，这里我推荐大家加到nn.BatchNorm2d这句下面，和这条代码保持缩进一致。

比如我想添加 SimAM 模块，我可以这样添加：

在这里插入图片描述

另外 yaml 文件中模块的 args 参数的写法也是取决于你 yolo.py 这部分怎么写的，二者是需要配套的。

同样是 SimAM 模块，这次的 yaml 可以这样写，不管你添加什么位置，args 的参数始终是 1e-4 。

在这里插入图片描述

7. SimAM 注意力模块

论文名称：《SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks》

论文地址：http://proceedings.mlr.press/v139/yang21o/yang21o.pdf

代码地址：https://github.com/ZjjConan/SimAM

7.1 原理

在这篇论文中，我们提出了一个概念上简单但非常有效的注意力模块，用于卷积神经网络。与现有的基于通道和空间的注意力模块不同，我们的模块通过推断特征图中的三维注意力权重来工作，而无需向原始网络添加参数。具体而言，我们基于一些著名的神经科学理论，并提出了优化能量函数以找到每个神经元重要性的方法。我们进一步推导了能量函数的快速闭合形式解，并展示了该解可以用不到十行代码实现。该模块的另一个优点是，大多数运算符是基于定义的能量函数解的选择，避免了过多的结构调整的工作。对各种视觉任务的定量评估表明，所提出的模块具有灵活性和有效性，可以提高许多ConvNets的表征能力。

在这里插入图片描述

7.2 代码

import torch
import torch.nn as nn

class SimAM(torch.nn.Module):
    def __init__(self, e_lambda=1e-4):
        super(SimAM, self).__init__()
        self.activaton = nn.Sigmoid()  # Sigmoid激活函数
        self.e_lambda = e_lambda  # 正则化项参数

    def forward(self, x):
        b, c, h, w = x.size()
        n = w * h - 1  # 计算像素数减一的值，用于正则化
        x_minus_mu_square = (x - x.mean(dim=[2, 3], keepdim=True)).pow(2)  # 计算(x - 平均值)^2
        y = x_minus_mu_square / (4 * (x_minus_mu_square.sum(dim=[2, 3], keepdim=True) / n + self.e_lambda)) + 0.5  # 计算SimAM的输出

        return x * self.activaton(y)  # 返回SimAM的输出，通过Sigmoid激活函数缩放

# 创建一个虚拟的输入张量，假设批大小为2，通道数为3，高度和宽度分别为4和4
input_tensor = torch.randn(2, 3, 4, 4)

# 创建SimAM模块
sim_am_module = SimAM()

# 将输入张量传递给SimAM模块进行前向传播
output_tensor = sim_am_module(input_tensor)

# 打印输出张量的形状，以验证模块的功能
print("Input shape:", input_tensor.shape)
print("Output shape:", output_tensor.shape)

            
[-1, 1, SimAM, [1e-4]], # args 不需要改变

8. S2-MLPv2 注意力模块

论文名称：《S^2-MLPV2: IMPROVED SPATIAL-SHIFT MLP ARCHITECTURE FOR VISION》

论文地址：https://arxiv.org/pdf/2108.01072.pdf

4.1 原理

最近，基于多层感知机(MLP)的视觉骨干网络开始出现。相比于CNN和视觉Transformer，具有较少归纳偏差的基于MLP的视觉架构在图像识别中取得了有竞争力的性能。其中，采用直接的空间位移操作的空间位移MLP (S2-MLP)在性能上优于包括MLP-mixer和ResMLP在内的开创性工作。最近，使用具有金字塔结构的较小补丁，Vision Permutator (ViP)和Global Filter Network (GFNet)在性能上优于S2-MLP。在本文中，我们改进了S2-MLP视觉骨干网络。我们沿通道维度扩展特征图，并将扩展后的特征图分成几个部分。我们对这些分割部分进行不同的空间位移操作。同时，我们利用分割注意力操作来融合这些分割部分。此外，与之前的方法类似，我们采用较小规模的补丁，并使用金字塔结构来提高图像识别准确性。我们将改进后的空间位移MLP视觉骨干网络称为S2-MLPv2。使用5500万个参数，我们的中等规模模型S2-MLPv2-Medium在ImageNet-1K基准测试中，使用224×224的图像，在没有自注意力和外部训练数据的情况下，实现了83.6%的top-1准确率。

在这里插入图片描述
图：S2-MLP和提出的S2-MLPv2之间的空间位移操作的比较。在S2-MLP中，通道被平均分成四个部分，每个部分沿不同的方向进行位移。在位移后的通道上进行MLP操作。相比之下，在S2-MLPv2中， c 通道特征图被扩展为 3c 通道特征图。然后，扩展的特征图沿通道维度被平均分成三个部分。对于每个部分，我们进行不同的空间位移操作。然后，通过分割注意力操作，将位移后的部分合并以生成 c 通道特征图。

8.2 代码

import torch
from torch import nn

# 空间位移操作1
def spatial_shift1(x):
    b, w, h, c = x.size()
    # 向右移动一列像素
    x[:, 1:, :, :c // 4] = x[:, :w - 1, :, :c // 4]
    # 向下移动一行像素
    x[:, :w - 1, :, c // 4:c // 2] = x[:, 1:, :, c // 4:c // 2]
    # 向右移动一列像素
    x[:, :, 1:, c // 2:c * 3 // 4] = x[:, :, :h - 1, c // 2:c * 3 // 4]
    # 向下移动一行像素
    x[:, :, :h - 1, 3 * c // 4:] = x[:, :, 1:, 3 * c // 4:]
    return x

# 空间位移操作2
def spatial_shift2(x):
    b, w, h, c = x.size()
    # 向下移动一行像素
    x[:, :, 1:, :c // 4] = x[:, :, :h - 1, :c // 4]
    # 向右移动一列像素
    x[:, :, :h - 1, c // 4:c // 2] = x[:, :, 1:, c // 4:c // 2]
    # 向下移动一行像素
    x[:, 1:, :, c // 2:c * 3 // 4] = x[:, :w - 1, :, c // 2:c * 3 // 4]
    # 向右移动一列像素
    x[:, :w - 1, :, 3 * c // 4:] = x[:, 1:, :, 3 * c // 4:]
    return x

# Split Attention 模块
class SplitAttention(nn.Module):
    def __init__(self, channel=512, k=3):
        super().__init__()
        self.channel = channel
        self.k = k
        # MLP层1，用于生成注意力权重
        self.mlp1 = nn.Linear(channel, channel, bias=False)
        self.gelu = nn.GELU()
        # MLP层2，用于生成注意力权重
        self.mlp2 = nn.Linear(channel, channel * k, bias=False)
        # Softmax层，用于计算注意力分布
        self.softmax = nn.Softmax(1)

    def forward(self, x_all):
        b, k, h, w, c = x_all.shape
        x_all = x_all.reshape(b, k, -1, c)
        # 计算所有通道的均值并生成注意力权重
        a = torch.sum(torch.sum(x_all, 1), 1)
        hat_a = self.mlp2(self.gelu(self.mlp1(a)))
        hat_a = hat_a.reshape(b, self.k, c)
        bar_a = self.softmax(hat_a)
        attention = bar_a.unsqueeze(-2)
        # 应用注意力权重并求和以生成输出
        out = attention * x_all
        out = torch.sum(out, 1).reshape(b, h, w, c)
        return out

# S2Attention 模块
class S2Attention(nn.Module):

    def __init__(self, channels=512):
        super().__init__()
        # MLP层1，用于生成通道注意力权重
        self.mlp1 = nn.Linear(channels, channels * 3)
        # MLP层2，用于生成通道注意力权重
        self.mlp2 = nn.Linear(channels, channels)
        # Split Attention模块，用于生成通道注意力权重
        self.split_attention = SplitAttention()

    def forward(self, x):
        b, c, w, h = x.size()
        x = x.permute(0, 2, 3, 1)
        # 应用MLP层1生成通道注意力权重
        x = self.mlp1(x)
        # 空间位移操作1
        x1 = spatial_shift1(x[:, :, :, :c])
        # 空间位移操作2
        x2 = spatial_shift2(x[:, :, :, c:c * 2])
        x3 = x[:, :, :, c * 2:]
        x_all = torch.stack([x1, x2, x3], 1)
        # 使用Split Attention模块生成通道注意力权重
        a = self.split_attention(x_all)
        # 应用MLP层2生成通道注意力权重
        x = self.mlp2(a)
        x = x.permute(0, 3, 1, 2)
        return x

# 测试代码
# 创建一个虚拟的输入张量，假设批大小为2，通道数为512，高度和宽度分别为32和32
input_tensor = torch.randn(2, 512, 32, 32)

# 创建S2Attention模块
s2_attention_module = S2Attention()

# 将输入张量传递给S2Attention模块进行前向传播
output_tensor = s2_attention_module(input_tensor)

# 打印输出张量的形状，以验证模块的功能
print("Input shape:", input_tensor.shape)
print("Output shape:", output_tensor.shape)

[-1, 1, S2Attention, [1024]],  # 1024 代表通道数

9. NAMAttention 注意力模块

论文名称：《NAM: Normalization-based Attention Module》

论文地址：https://arxiv.org/pdf/2111.12419.pdf

代码地址：https://github.com/Christian-lyc/NAM

9.1 原理

识别较不显著的特征对于模型压缩至关重要。然而，在革命性的注意力机制中，这一点尚未得到研究。在本研究中，我们提出了一种新颖的基于规范化的注意力模块(NAM)，它抑制了较不显著的权重。它对注意力模块应用权重稀疏惩罚，从而使其在保持类似性能的同时更加计算高效。在Resnet和Mobilenet上与其他三种注意力机制进行比较表明，我们的方法具有更高的准确性。

在这里插入图片描述

9.2 代码

import torch.nn as nn
import torch
from torch.nn import functional as F

# 通道注意力模块
class Channel_Att(nn.Module):
    def __init__(self, channels):
        super(Channel_Att, self).__init__()
        self.channels = channels

        # 批归一化层，用于计算通道权重
        self.bn2 = nn.BatchNorm2d(self.channels, affine=True)

    def forward(self, x):
        residual = x  # 保留输入的副本作为残差连接的一部分

        # 对输入进行批归一化
        x = self.bn2(x)

        # 计算批归一化层的权重，用于通道加权
        weight_bn = self.bn2.weight.data.abs() / torch.sum(self.bn2.weight.data.abs())

        # 调整输入的维度顺序以进行通道加权
        x = x.permute(0, 2, 3, 1).contiguous()
        x = torch.mul(weight_bn, x)
        x = x.permute(0, 3, 1, 2).contiguous()

        # 使用Sigmoid激活函数对加权后的输入进行缩放
        x = torch.sigmoid(x) * residual  # 加权后乘以输入作为输出

        return x

# 嵌套的注意力模块
class NAMAttention(nn.Module):
    def __init__(self, channels):
        super(NAMAttention, self).__init__()
        self.Channel_Att = Channel_Att(channels)

    def forward(self, x):
        # 调用通道注意力模块进行前向传播
        x_out1 = self.Channel_Att(x)

        return x_out1

# 测试代码
# 创建一个虚拟的输入张量，假设批大小为2，通道数为64，高度和宽度分别为32和32
input_tensor = torch.randn(2, 64, 32, 32)

# 创建NAMAttention模块
nam_attention_module = NAMAttention(64)

# 将输入张量传递给NAMAttention模块进行前向传播
output_tensor = nam_attention_module(input_tensor)

# 打印输出张量的形状，以验证模块的功能
print("Input shape:", input_tensor.shape)
print("Output shape:", output_tensor.shape)

[-1, 1, NAMAttention, [1024]], # 1024 代表通道数

10. Criss-CrossAttention 注意力模块

论文名称：《CCNet: Criss-Cross Attention for Semantic Segmentation》

论文地址：https://arxiv.org/pdf/1811.11721.pdf

代码地址：https://github.com/speedinghzl/CCNet

10.1 原理

在这里插入图片描述

在语义分割和物体检测等视觉理解问题中，上下文信息至关重要。我们提出了一种名为Criss-Cross Network (CCNet)的方法，以非常有效和高效的方式获取完整的图像上下文信息。具体来说，对于每个像素，一个新颖的交叉注意力模块收集其交叉路径上所有像素的上下文信息。通过进一步的循环操作，每个像素最终可以捕获完整的图像依赖关系。此外，我们提出了一种类别一致的损失，用于强制交叉注意力模块产生更具辨别力的特征。总体而言，CCNet 具有以下优点：1) 对GPU内存友好。与非局部块相比，提出的循环交叉注意力模块使用的GPU内存减少了11倍。2) 高计算效率。循环交叉注意力显著减少了非局部块的FLOPs，约为其85%。3) 具备最先进的性能。我们在包括Cityscapes、ADE20K、LIP以及COCO等语义分割基准数据集上进行了大量实验。特别是，在Cityscapes测试集、ADE20K验证集和LIP验证集上，我们的CCNet分别实现了81.9%、45.76%和55.47%的mIoU得分，这些是最新的最先进结果。

10.2 代码

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Softmax

# 生成无穷大矩阵的函数，用于设置注意力中的负无穷
def INF(B, H, W):
    return -torch.diag(torch.tensor(float("inf")).repeat(H), 0).unsqueeze(0).repeat(B * W, 1, 1)

class CrissCrossAttention(nn.Module):
    """ Criss-Cross Attention Module"""

    def __init__(self, in_dim):
        super(CrissCrossAttention, self).__init__()
        
        # 定义查询（query）卷积、键（key）卷积和值（value）卷积
        self.query_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.key_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.value_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim, kernel_size=1)
        self.softmax = Softmax(dim=3)
        self.INF = INF
        self.gamma = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        m_batchsize, _, height, width = x.size()

        # 生成查询卷积，然后调整维度以计算能量
        proj_query = self.query_conv(x)
        proj_query_H = proj_query.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height).permute(0, 2, 1)
        proj_query_W = proj_query.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width).permute(0, 2, 1)

        # 生成键卷积，然后调整维度以计算能量
        proj_key = self.key_conv(x)
        proj_key_H = proj_key.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height)
        proj_key_W = proj_key.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width)

        # 生成值卷积，然后调整维度以计算注意力
        proj_value = self.value_conv(x)
        proj_value_H = proj_value.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height)
        proj_value_W = proj_value.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width)

        # 计算水平（H）方向和垂直（W）方向的能量
        energy_H = (torch.bmm(proj_query_H, proj_key_H) + self.INF(m_batchsize, height, width)).view(m_batchsize, width, height, height).permute(0, 2, 1, 3)
        energy_W = torch.bmm(proj_query_W, proj_key_W).view(m_batchsize, height, width, width)

        # 连接水平和垂直能量，然后应用softmax以获得注意力权重
        concate = self.softmax(torch.cat([energy_H, energy_W], 3))

        # 分离水平和垂直方向的注意力权重
        att_H = concate[:, :, :, 0:height].permute(0, 2, 1, 3).contiguous().view(m_batchsize * width, height, height)
        att_W = concate[:, :, :, height:height + width].contiguous().view(m_batchsize * height, width, width)

        # 应用注意力权重到值（value）并调整输出维度
        out_H = torch.bmm(proj_value_H, att_H.permute(0, 2, 1)).view(m_batchsize, width, -1, height).permute(0, 2, 3, 1)
        out_W = torch.bmm(proj_value_W, att_W.permute(0, 2, 1)).view(m_batchsize, height, -1, width).permute(0, 2, 1, 3)

        # 返回经过注意力操作后的结果，同时加上原始输入作为残差连接
        return self.gamma * (out_H + out_W) + x

# 测试代码
# 创建一个虚拟的输入张量，假设批大小为2，通道数为64，高度和宽度分别为32和32
input_tensor = torch.randn(2, 64, 32, 32)

# 创建CrissCrossAttention模块
cca_module = CrissCrossAttention(64)

# 将输入张量传递给CrissCrossAttention模块进行前向传播
output_tensor = cca_module(input_tensor)

# 打印输出张量的形状，以验证模块的功能
print("Input shape:", input_tensor.shape)
print("Output shape:", output_tensor.shape)

[-1, 1, CrissCrossAttention, [1024]], # 1024 代表通道数

12. Selective Kernel Attention 注意力模块

论文名称：《Selective Kernel Networks》

论文地址：https://arxiv.org/pdf/1903.06586.pdf

代码地址：https://github.com/implus/SKNet

12.1 原理

在标准的卷积神经网络中，每层人工神经元的感受野被设计为具有相同的大小。神经科学界已经广泛认识到，视觉皮层神经元的感受野大小会受到刺激的调节，然而在构建CNN时很少考虑这一点。我们提出了一种动态选择机制，使得CNN中的每个神经元可以根据多个输入信息的尺度自适应地调整其感受野大小。我们设计了一个称为Selective Kernel (SK)单元的构建块，其中使用softmax注意力将具有不同核大小的多个分支进行融合，这种注意力受到这些分支中的信息的指导。对这些分支的不同注意力产生了融合层中神经元的有效感受野的不同大小。多个SK单元堆叠成一个深度网络，称为Selective Kernel Networks (SKNets)。在 ImageNet 和 CIFAR 基准测试中，我们经验证明SKNet 在模型复杂度较低的情况下胜过了现有的最先进架构。详细分析表明，SKNet中的神经元能够捕捉具有不同尺度的目标对象，从而验证了神经元根据输入自适应调整感受野大小的能力。

在这里插入图片描述

12.2 代码

import torch
import torch.nn as nn
from collections import OrderedDict

class SKAttention(nn.Module):

    def __init__(self, channel=512, kernels=[1, 3, 5, 7], reduction=16, group=1, L=32):
        super().__init__()
        self.d = max(L, channel // reduction)
        self.convs = nn.ModuleList([])
        # 创建不同尺寸卷积核的卷积层，每个卷积层包含卷积、批归一化和ReLU激活
        for k in kernels:
            self.convs.append(
                nn.Sequential(OrderedDict([
                    ('conv', nn.Conv2d(channel, channel, kernel_size=k, padding=k // 2, groups=group)),
                    ('bn', nn.BatchNorm2d(channel)),
                    ('relu', nn.ReLU())
                ]))
            )
        self.fc = nn.Linear(channel, self.d)
        self.fcs = nn.ModuleList([])
        # 创建全连接层
        for i in range(len(kernels)):
            self.fcs.append(nn.Linear(self.d, channel))
        self.softmax = nn.Softmax(dim=0)

    def forward(self, x):
        bs, c, _, _ = x.size()
        conv_outs = []
        ### 分割：通过不同尺寸卷积核进行卷积操作
        for conv in self.convs:
            conv_outs.append(conv(x))
        feats = torch.stack(conv_outs, 0)  # k,bs,channel,h,w

        ### 融合：将不同卷积核的输出叠加在一起
        U = sum(conv_outs)  # bs,c,h,w

        ### 降维通道：通过均值池化降低通道数
        S = U.mean(-1).mean(-1)  # bs,c
        Z = self.fc(S)  # bs,d

        ### 计算注意力权重
        weights = []
        for fc in self.fcs:
            weight = fc(Z)
            weights.append(weight.view(bs, c, 1, 1))  # bs,channel
        attention_weights = torch.stack(weights, 0)  # k,bs,channel,1,1
        attention_weights = self.softmax(attention_weights)  # k,bs,channel,1,1

        ### 融合：将注意力权重与特征图相乘并求和
        V = (attention_weights * feats).sum(0)
        return V

# 测试代码
if __name__ == "__main__":
    # 创建SKAttention的实例
    sk_attention = SKAttention(channel=512, kernels=[1, 3, 5, 7], reduction=16, group=1, L=32)
    
    # 创建一个随机输入张量
    input_tensor = torch.randn(32, 512, 64, 64)  # 批次大小：32，通道数：512，高度：64，宽度：64
    
    # 前向传播
    output = sk_attention(input_tensor)
    
    # 打印输出张量的形状以进行验证
    print("输出形状:", output.shape)

[-1, 1, SKAttention, [[1, 3, 5, 7], 16, 1, 32]], # args 都不需要改变

11. GAMAttention 注意力模块

论文名称：《Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions》

论文地址：https://arxiv.org/pdf/2112.05561v1.pdf

11.1 原理

各种注意机制已经被研究用于改善各种计算机视觉任务的性能。然而，先前的方法忽视了在通道和空间方面保留信息以增强跨维度交互的重要性。因此，我们提出了一种全局注意机制，通过减少信息损失和放大全局交互表示来提升深度神经网络的性能。我们引入了3D排列和多层感知机用于通道注意，同时还引入了卷积空间注意的子模块。对于在CIFAR-100和ImageNet-1K上进行的图像分类任务的评估表明，我们的方法在ResNet和轻量级MobileNet上稳定地优于几种最近的注意机制。

在这里插入图片描述

通道注意子模块使用3D排列来跨三个维度保留信息。然后，它使用两层MLP（多层感知机）放大跨维度的通道-空间依赖关系。（MLP是一个编码器-解码器结构，具有与BAM相同的减少比例r)。通道注意子模块如图2所示。

在空间注意子模块中，为了关注空间信息，我们使用两个卷积层进行空间信息融合。我们还使用与通道注意子模块相同的减少比例r，与BAM相同。同时，最大池化会减少信息并对结果产生负面影响。为了进一步保留特征图，我们去除了池化操作。结果是，空间注意模块有时会显著增加参数的数量。为了防止参数数量显著增加，我们在ResNet50中采用了具有通道随机洗牌的组卷积。不带组卷积的空间注意子模块如图3所示。

11.2 代码

import numpy as np
import torch
from torch import nn
from torch.nn import init

class GAMAttention(nn.Module):

    def __init__(self, c1, c2, group=True, rate=4):
        super(GAMAttention, self).__init__()

        self.channel_attention = nn.Sequential(
            nn.Linear(c1, int(c1 / rate)),
            nn.ReLU(inplace=True),
            nn.Linear(int(c1 / rate), c1)
        )
        self.spatial_attention = nn.Sequential(
            nn.Conv2d(c1, c1 // rate, kernel_size=7, padding=3, groups=rate) if group else nn.Conv2d(c1, int(c1 / rate),
                                                                                                     kernel_size=7,
                                                                                                     padding=3),
            nn.BatchNorm2d(int(c1 / rate)),
            nn.ReLU(inplace=True),
            nn.Conv2d(c1 // rate, c2, kernel_size=7, padding=3, groups=rate) if group else nn.Conv2d(int(c1 / rate), c2,
                                                                                                     kernel_size=7,
                                                                                                     padding=3),
            nn.BatchNorm2d(c2)
        )

    def forward(self, x):
        b, c, h, w = x.shape
        x_permute = x.permute(0, 2, 3, 1).view(b, -1, c)
        x_att_permute = self.channel_attention(x_permute).view(b, h, w, c)
        x_channel_att = x_att_permute.permute(0, 3, 1, 2)
        x = x * x_channel_att

        x_spatial_att = self.spatial_attention(x).sigmoid()
        x_spatial_att = channel_shuffle(x_spatial_att, 4)  # last shuffle
        out = x * x_spatial_att
        return out
        elif m in [GAMAttention]:  # in_channels out_channels
            c1, c2 = ch[f], args[0]
            if c2 != no:
                c2 = make_divisible(c2 * gw, 8)
            args = [c1, c2, *args[1:]]
[-1, 1, GAMAttention, [1024, True, 4]], # 1024 代表通道数 这里的True和4保持不变

14. A2-Net 注意力模块

论文名称：《A2-Nets: Double Attention Networks》

论文地址：https://arxiv.org/pdf/1810.11579.pdf

14.1 原理

学习捕捉远距离关系对于图像/视频识别是基础性的。现有的CNN模型通常依靠增加深度来建模这些关系，这在很大程度上效率低下。在这项工作中，我们提出了“双重注意力块”，这是一种新颖的组件，它可以从输入图像/视频的整个时空空间聚合和传播有信息的全局特征，使得后续的卷积层可以高效地访问整个空间的特征。该组件设计了一个双重注意力机制的两个步骤，第一步通过二阶注意力池将整个空间的特征聚集到一个紧凑的集合中，第二步通过另一个注意力机制自适应地选择和分配特征到每个位置。所提出的双重注意力块易于采用，并可以方便地插入到现有的深度神经网络中。我们进行了大量的消融研究和实验证明其性能。在图像识别任务中，使用我们的双重注意力块装备的ResNet-50在ImageNet-1k数据集上胜过了更大的ResNet-152架构，参数数量减少了40%以上，FLOPs也减少了。在动作识别任务中，我们提出的模型在Kinetics和UCF-101数据集上取得了最先进的结果，并具有比最近的研究工作更高的效率。
在这里插入图片描述

14.2 代码

from torch.nn import init

class DoubleAttention(nn.Module):

    def __init__(self, in_channels, c_m, c_n, reconstruct=True):
        super().__init__()
        self.in_channels = in_channels
        self.reconstruct = reconstruct
        self.c_m = c_m
        self.c_n = c_n
        self.convA = nn.Conv2d(in_channels, c_m, 1)
        self.convB = nn.Conv2d(in_channels, c_n, 1)
        self.convV = nn.Conv2d(in_channels, c_n, 1)
        if self.reconstruct:
            self.conv_reconstruct = nn.Conv2d(c_m, in_channels, kernel_size=1)
        self.init_weights()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)

    def forward(self, x):
        b, c, h, w = x.shape
        assert c == self.in_channels
        A = self.convA(x)  # b,c_m,h,w
        B = self.convB(x)  # b,c_n,h,w
        V = self.convV(x)  # b,c_n,h,w
        tmpA = A.view(b, self.c_m, -1)
        attention_maps = F.softmax(B.view(b, self.c_n, -1), dim=1)
        attention_vectors = F.softmax(V.view(b, self.c_n, -1), dim=1)
        # step 1: feature gating
        global_descriptors = torch.bmm(tmpA, attention_maps.permute(0, 2, 1))  # b.c_m,c_n
        # step 2: feature distribution
        tmpZ = global_descriptors.matmul(attention_vectors)  # b,c_m,h*w
        tmpZ = tmpZ.view(b, self.c_m, h, w)  # b,c_m,h,w
        if self.reconstruct:
            tmpZ = self.conv_reconstruct(tmpZ)

        return tmpZ
        elif m in [DoubleAttention]:  # channels args
            c2 = ch[f]
            args = [c2, *args[0:]]
[-1, 1, DoubleAttention, [128, 128,True]], # args 都不需要改变

在C3模块中加入注意力机制

1.第一版本添加方式介绍

请添加图片描述

1.1 C3SE

第一步；要把注意力结构代码放到common.py文件中，以C3SE举例，将这段代码粘贴到common.py文件中

class SEBottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5, ratio=16):  # ch_in, ch_out, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2
        # self.se=SE(c1,c2,ratio)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.l1 = nn.Linear(c1, c1 // ratio, bias=False)
        self.relu = nn.ReLU(inplace=True)
        self.l2 = nn.Linear(c1 // ratio, c1, bias=False)
        self.sig = nn.Sigmoid()

    def forward(self, x):
        x1 = self.cv2(self.cv1(x))
        b, c, _, _ = x.size()
        y = self.avgpool(x1).view(b, c)
        y = self.l1(y)
        y = self.relu(y)
        y = self.l2(y)
        y = self.sig(y)
        y = y.view(b, c, 1, 1)
        out = x1 * y.expand_as(x1)

        # out=self.se(x1)*x1
        return x + out if self.add else out


class C3SE(C3):
    # C3 module with SEBottleneck()
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)  # hidden channels
        self.m = nn.Sequential(*(SEBottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))

第二步；找到yolo.py文件里的parse_model函数，将类名加入进去
在这里插入图片描述

第三步；修改配置文件（我这里拿yolov5s.yaml举例子），将C3层替换为我们新引入的C3SE层
yolov5s_C3SE.yaml

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3SE, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3SE, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3SE, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3SE, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

在这里插入图片描述
其它注意力机制同理

1.2 C3CA

class CABottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5, ratio=32):  # ch_in, ch_out, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2
        # self.ca=CoordAtt(c1,c2,ratio)
        self.pool_h = nn.AdaptiveAvgPool2d((None, 1))
        self.pool_w = nn.AdaptiveAvgPool2d((1, None))
        mip = max(8, c1 // ratio)
        self.conv1 = nn.Conv2d(c1, mip, kernel_size=1, stride=1, padding=0)
        self.bn1 = nn.BatchNorm2d(mip)
        self.act = h_swish()
        self.conv_h = nn.Conv2d(mip, c2, kernel_size=1, stride=1, padding=0)
        self.conv_w = nn.Conv2d(mip, c2, kernel_size=1, stride=1, padding=0)
        
    def forward(self, x):
        x1=self.cv2(self.cv1(x))
        n, c, h, w = x.size()
        # c*1*W
        x_h = self.pool_h(x1)
        # c*H*1
        # C*1*h
        x_w = self.pool_w(x1).permute(0, 1, 3, 2)
        y = torch.cat([x_h, x_w], dim=2)
        # C*1*(h+w)
        y = self.conv1(y)
        y = self.bn1(y)
        y = self.act(y)
        x_h, x_w = torch.split(y, [h, w], dim=2)
        x_w = x_w.permute(0, 1, 3, 2)
        a_h = self.conv_h(x_h).sigmoid()
        a_w = self.conv_w(x_w).sigmoid()
        out = x1 * a_w * a_h

        # out=self.ca(x1)*x1
        return x + out if self.add else out


class C3CA(C3):
    # C3 module with CABottleneck()
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)  # hidden channels
        self.m = nn.Sequential(*(CABottleneck(c_, c_,shortcut, g, e=1.0) for _ in range(n)))

1.3 C3CBAM

class CBAMBottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5,ratio=16,kernel_size=7):  # ch_in, ch_out, shortcut, groups, expansion
        super(CBAMBottleneck,self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2
        self.channel_attention = ChannelAttention(c2, ratio)
        self.spatial_attention = SpatialAttention(kernel_size)
        #self.cbam=CBAM(c1,c2,ratio,kernel_size)

    def forward(self, x):
        x1 = self.cv2(self.cv1(x))
        out = self.channel_attention(x1) * x1
        # print('outchannels:{}'.format(out.shape))
        out = self.spatial_attention(out) * out
        return x + out if self.add else out


class C3CBAM(C3):
    # C3 module with CBAMBottleneck()
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)  # hidden channels
        self.m = nn.Sequential(*(CBAMBottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))

1.4 C3ECA

class ECABottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5, ratio=16, k_size=3):  # ch_in, ch_out, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2
        # self.eca=ECA(c1,c2)
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.conv = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x1 = self.cv2(self.cv1(x))
        # out=self.eca(x1)*x1
        y = self.avg_pool(x1)
        y = self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
        y = self.sigmoid(y)
        out = x1 * y.expand_as(x1)

        return x + out if self.add else out


class C3ECA(C3):
    # C3 module with ECABottleneck()
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)  # hidden channels
        self.m = nn.Sequential(*(ECABottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))

2.第二版本添加方式介绍

请添加图片描述

2.1 C3_SE_Attention

class C3_SE_Attention(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))
        self._SE = SE(c2, c2)

    def forward(self, x):
        return self._SE(self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)))

2.2 C3_ECA_Attention

class C3_ECA_Attention(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))
        self._ECA = ECA(c2, c2)

    def forward(self, x):
        return self._ECA(self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)))

2.3 C3_CBAM_Attention

class C3_CBAM_Attention(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))
        self._CBAM = CBAM(c2, c2)

    def forward(self, x):
        return self._CBAM(self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)))

2.4 C3_CoorAtt_Attention

class C3_CoordAtt_Attention(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))
        self._CoordAtt = CoordAtt(c2, c2)

    def forward(self, x):
        return self._CoordAtt(self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)))

第二版配置文件：

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3_CoorAtt_Attention, [128]], #可替换为C3_SE_Attention/C3_ECA_Attention/C3_CABAM_Attention
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3_CoorAtt_Attention, [256]], #可替换为C3_SE_Attention/C3_ECA_Attention/C3_CABAM_Attention
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3_CoorAtt_Attention, [512]], #可替换为C3_SE_Attention/C3_ECA_Attention/C3_CABAM_Attention
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3_CoorAtt_Attention, [1024]], #可替换为C3_SE_Attention/C3_ECA_Attention/C3_CABAM_Attention
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

Github项目：Yolov5_Magic

更多注意力机制及代码

在这里插入图片描述

前言

本篇博文继续追加10余种。与之前不同的是，此篇博文代码添加方式更加严谨更加鲁棒，使用更简单，针对不同种类注意力，将yolo.py中注意力模块读取方式分类编写，并且将注意力机制的添加方式“模板化”，每个注意力模块都可以在任何模板中直接插入，除少数模块外，不需要考虑通道信息，更适合小白入门。

之前为了代码添加方便在注意力模块中添加了很多无效的实参，此次完全移除，保留原本注意力模块的参数信息，并且在模型配置文件上给与大家更多可调空间。

以下注意力模块均已适配YOLOv5/v7/v8，且每个注意力模块都有对应的论文。
在这里插入图片描述

注意力代码

添加到 common.py 末尾；

# ---------------------------SE Begin---------------------------
class SE(nn.Module):
    def __init__(self, c1, ratio=16):
        super(SE, self).__init__()
        # c*1*1
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.l1 = nn.Linear(c1, c1 // ratio, bias=False)
        self.relu = nn.ReLU(inplace=True)
        self.l2 = nn.Linear(c1 // ratio, c1, bias=False)
        self.sig = nn.Sigmoid()
 
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avgpool(x).view(b, c)
        y = self.l1(y)
        y = self.relu(y)
        y = self.l2(y)
        y = self.sig(y)
        y = y.view(b, c, 1, 1)
        return x * y.expand_as(x)
 
 
# ---------------------------SE End---------------------------
 
 
# ---------------------------CBAM Begin---------------------------
class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)
        self.f1 = nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False)
        self.relu = nn.ReLU()
        self.f2 = nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False)
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        avg_out = self.f2(self.relu(self.f1(self.avg_pool(x))))
        max_out = self.f2(self.relu(self.f1(self.max_pool(x))))
        out = self.sigmoid(avg_out + max_out)
        return out
 
 
class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        assert kernel_size in (3, 7), 'kernel size must be 3 or 7'
        padding = 3 if kernel_size == 7 else 1
        self.conv = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False)
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        # 1*h*w
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        x = torch.cat([avg_out, max_out], dim=1)
        # 2*h*w
        x = self.conv(x)
        # 1*h*w
        return self.sigmoid(x)
 
 
class CBAM(nn.Module):
    def __init__(self, c1, ratio=16, kernel_size=7):
        super(CBAM, self).__init__()
        self.channel_attention = ChannelAttention(c1, ratio)
        self.spatial_attention = SpatialAttention(kernel_size)
 
    def forward(self, x):
        out = self.channel_attention(x) * x
        # c*h*w
        # c*h*w * 1*h*w
        out = self.spatial_attention(out) * out
        return out
 
 
# ---------------------------CBAM End---------------------------
 
 
# ---------------------------ECA Begin---------------------------
class ECA(nn.Module):
 
    def __init__(self, k_size=3):
        super(ECA, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.conv = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False)
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        # feature descriptor on the global spatial information
        y = self.avg_pool(x)
        y = self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
        # Multi-scale information fusion
        y = self.sigmoid(y)
 
        return x * y.expand_as(x)
 
 
# ---------------------------ECA End---------------------------
 
 
# ---------------------------CA Begin---------------------------
class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)
 
    def forward(self, x):
        return self.relu(x + 3) / 6
 
 
class h_swish(nn.Module):
    def __init__(self, inplace=True):
        super(h_swish, self).__init__()
        self.sigmoid = h_sigmoid(inplace=inplace)
 
    def forward(self, x):
        return x * self.sigmoid(x)
 
 
class CoordAtt(nn.Module):
    def __init__(self, inp, oup, reduction=32):
        super(CoordAtt, self).__init__()
        self.pool_h = nn.AdaptiveAvgPool2d((None, 1))
        self.pool_w = nn.AdaptiveAvgPool2d((1, None))
        mip = max(8, inp // reduction)
        self.conv1 = nn.Conv2d(inp, mip, kernel_size=1, stride=1, padding=0)
        self.bn1 = nn.BatchNorm2d(mip)
        self.act = h_swish()
        self.conv_h = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)
        self.conv_w = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)
 
    def forward(self, x):
        identity = x
        n, c, h, w = x.size()
        # c*1*W
        x_h = self.pool_h(x)
        # c*H*1
        # C*1*h
        x_w = self.pool_w(x).permute(0, 1, 3, 2)
        y = torch.cat([x_h, x_w], dim=2)
        # C*1*(h+w)
        y = self.conv1(y)
        y = self.bn1(y)
        y = self.act(y)
        x_h, x_w = torch.split(y, [h, w], dim=2)
        x_w = x_w.permute(0, 1, 3, 2)
        a_h = self.conv_h(x_h).sigmoid()
        a_w = self.conv_w(x_w).sigmoid()
        out = identity * a_w * a_h
        return out
 
 
# ---------------------------CA End---------------------------
 
 
# ---------------------------SimAM Begin---------------------------
class SimAM(torch.nn.Module):
    def __init__(self, e_lambda=1e-4):
        super(SimAM, self).__init__()
 
        self.activaton = nn.Sigmoid()
        self.e_lambda = e_lambda
 
    def __repr__(self):
        s = self.__class__.__name__ + '('
        s += ('lambda=%f)' % self.e_lambda)
        return s
 
    @staticmethod
    def get_module_name():
        return "simam"
 
    def forward(self, x):
        b, c, h, w = x.size()
 
        n = w * h - 1
 
        x_minus_mu_square = (x - x.mean(dim=[2, 3], keepdim=True)).pow(2)
        y = x_minus_mu_square / (4 * (x_minus_mu_square.sum(dim=[2, 3], keepdim=True) / n + self.e_lambda)) + 0.5
 
        return x * self.activaton(y)
 
 
# ---------------------------SimAM End---------------------------
 
 
# ---------------------------S2-MLPv2 Begin---------------------------
def spatial_shift1(x):
    b, w, h, c = x.size()
    x[:, 1:, :, :c // 4] = x[:, :w - 1, :, :c // 4]
    x[:, :w - 1, :, c // 4:c // 2] = x[:, 1:, :, c // 4:c // 2]
    x[:, :, 1:, c // 2:c * 3 // 4] = x[:, :, :h - 1, c // 2:c * 3 // 4]
    x[:, :, :h - 1, 3 * c // 4:] = x[:, :, 1:, 3 * c // 4:]
    return x
 
 
def spatial_shift2(x):
    b, w, h, c = x.size()
    x[:, :, 1:, :c // 4] = x[:, :, :h - 1, :c // 4]
    x[:, :, :h - 1, c // 4:c // 2] = x[:, :, 1:, c // 4:c // 2]
    x[:, 1:, :, c // 2:c * 3 // 4] = x[:, :w - 1, :, c // 2:c * 3 // 4]
    x[:, :w - 1, :, 3 * c // 4:] = x[:, 1:, :, 3 * c // 4:]
    return x
 
 
class SplitAttention(nn.Module):
    def __init__(self, channel=512, k=3):
        super().__init__()
        self.channel = channel
        self.k = k
        self.mlp1 = nn.Linear(channel, channel, bias=False)
        self.gelu = nn.GELU()
        self.mlp2 = nn.Linear(channel, channel * k, bias=False)
        self.softmax = nn.Softmax(1)
 
    def forward(self, x_all):
        b, k, h, w, c = x_all.shape
        x_all = x_all.reshape(b, k, -1, c)
        a = torch.sum(torch.sum(x_all, 1), 1)
        hat_a = self.mlp2(self.gelu(self.mlp1(a)))
        hat_a = hat_a.reshape(b, self.k, c)
        bar_a = self.softmax(hat_a)
        attention = bar_a.unsqueeze(-2)
        out = attention * x_all
        out = torch.sum(out, 1).reshape(b, h, w, c)
        return out
 
 
class S2Attention(nn.Module):
 
    def __init__(self, channels=512):
        super().__init__()
        self.mlp1 = nn.Linear(channels, channels * 3)
        self.mlp2 = nn.Linear(channels, channels)
        self.split_attention = SplitAttention()
 
    def forward(self, x):
        b, c, w, h = x.size()
        x = x.permute(0, 2, 3, 1)
        x = self.mlp1(x)
        x1 = spatial_shift1(x[:, :, :, :c])
        x2 = spatial_shift2(x[:, :, :, c:c * 2])
        x3 = x[:, :, :, c * 2:]
        x_all = torch.stack([x1, x2, x3], 1)
        a = self.split_attention(x_all)
        x = self.mlp2(a)
        x = x.permute(0, 3, 1, 2)
        return x
 
 
# ---------------------------S2-MLPv2 End---------------------------
 
 
# ---------------------------NAMAttention Begin---------------------------
class Channel_Att(nn.Module):
    def __init__(self, channels):
        super(Channel_Att, self).__init__()
        self.channels = channels
 
        self.bn2 = nn.BatchNorm2d(self.channels, affine=True)
 
    def forward(self, x):
        residual = x
 
        x = self.bn2(x)
        weight_bn = self.bn2.weight.data.abs() / torch.sum(self.bn2.weight.data.abs())
        x = x.permute(0, 2, 3, 1).contiguous()
        x = torch.mul(weight_bn, x)
        x = x.permute(0, 3, 1, 2).contiguous()
 
        x = torch.sigmoid(x) * residual  #
 
        return x
 
 
class NAMAttention(nn.Module):
    def __init__(self, channels):
        super(NAMAttention, self).__init__()
        self.Channel_Att = Channel_Att(channels)
 
    def forward(self, x):
        x_out1 = self.Channel_Att(x)
 
        return x_out1
 
 
# ---------------------------NAMAttention End---------------------------
 
 
# ---------------------------Criss-CrossAttention Begin---------------------------
from torch.nn import Softmax
 
 
def INF(B, H, W):
    return -torch.diag(torch.tensor(float("inf")).repeat(H), 0).unsqueeze(0).repeat(B * W, 1, 1)
 
 
class CrissCrossAttention(nn.Module):
    """ Criss-Cross Attention Module"""
 
    def __init__(self, in_dim):
        super(CrissCrossAttention, self).__init__()
        self.query_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.key_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.value_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim, kernel_size=1)
        self.softmax = Softmax(dim=3)
        self.INF = INF
        self.gamma = nn.Parameter(torch.zeros(1))
 
    def forward(self, x):
        m_batchsize, _, height, width = x.size()
        proj_query = self.query_conv(x)
        proj_query_H = proj_query.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height).permute(0, 2,
                                                                                                                 1)
        proj_query_W = proj_query.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width).permute(0, 2,
                                                                                                                 1)
        proj_key = self.key_conv(x)
        proj_key_H = proj_key.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height)
        proj_key_W = proj_key.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width)
        proj_value = self.value_conv(x)
        proj_value_H = proj_value.permute(0, 3, 1, 2).contiguous().view(m_batchsize * width, -1, height)
        proj_value_W = proj_value.permute(0, 2, 1, 3).contiguous().view(m_batchsize * height, -1, width)
        energy_H = (torch.bmm(proj_query_H, proj_key_H) + self.INF(m_batchsize, height, width)).view(m_batchsize, width,
                                                                                                     height,
                                                                                                     height).permute(0,
                                                                                                                     2,
                                                                                                                     1,
                                                                                                                     3)
        energy_W = torch.bmm(proj_query_W, proj_key_W).view(m_batchsize, height, width, width)
        concate = self.softmax(torch.cat([energy_H, energy_W], 3))
 
        att_H = concate[:, :, :, 0:height].permute(0, 2, 1, 3).contiguous().view(m_batchsize * width, height, height)
        # print(concate)
        # print(att_H)
        att_W = concate[:, :, :, height:height + width].contiguous().view(m_batchsize * height, width, width)
        out_H = torch.bmm(proj_value_H, att_H.permute(0, 2, 1)).view(m_batchsize, width, -1, height).permute(0, 2, 3, 1)
        out_W = torch.bmm(proj_value_W, att_W.permute(0, 2, 1)).view(m_batchsize, height, -1, width).permute(0, 2, 1, 3)
        # print(out_H.size(),out_W.size())
        return self.gamma * (out_H + out_W) + x
 
 
# ---------------------------Criss-CrossAttention End---------------------------
 
 
# ---------------------------GAMAttention Begin---------------------------
class GAMAttention(nn.Module):
 
    def __init__(self, c1, c2, group=True, rate=4):
        super(GAMAttention, self).__init__()
 
        self.channel_attention = nn.Sequential(
            nn.Linear(c1, int(c1 / rate)),
            nn.ReLU(inplace=True),
            nn.Linear(int(c1 / rate), c1)
        )
        self.spatial_attention = nn.Sequential(
            nn.Conv2d(c1, c1 // rate, kernel_size=7, padding=3, groups=rate) if group else nn.Conv2d(c1, int(c1 / rate),
                                                                                                     kernel_size=7,
                                                                                                     padding=3),
            nn.BatchNorm2d(int(c1 / rate)),
            nn.ReLU(inplace=True),
            nn.Conv2d(c1 // rate, c2, kernel_size=7, padding=3, groups=rate) if group else nn.Conv2d(int(c1 / rate), c2,
                                                                                                     kernel_size=7,
                                                                                                     padding=3),
            nn.BatchNorm2d(c2)
        )
 
    def forward(self, x):
        b, c, h, w = x.shape
        x_permute = x.permute(0, 2, 3, 1).view(b, -1, c)
        x_att_permute = self.channel_attention(x_permute).view(b, h, w, c)
        x_channel_att = x_att_permute.permute(0, 3, 1, 2)
        x = x * x_channel_att
 
        x_spatial_att = self.spatial_attention(x).sigmoid()
        x_spatial_att = channel_shuffle(x_spatial_att, 4)  # last shuffle
        out = x * x_spatial_att
        return out
 
 
def channel_shuffle(x, groups=2):  ##shuffle channel
    # RESHAPE----->transpose------->Flatten
    B, C, H, W = x.size()
    out = x.view(B, groups, C // groups, H, W).permute(0, 2, 1, 3, 4).contiguous()
    out = out.view(B, C, H, W)
    return out
 
 
# ---------------------------GAMAttention End---------------------------
 
 
# ---------------------------Selective Kernel Attention Begin-------------------------
class SKAttention(nn.Module):
 
    def __init__(self, channel=512, kernels=[1, 3, 5, 7], reduction=16, group=1, L=32):
        super().__init__()
        self.d = max(L, channel // reduction)
        self.convs = nn.ModuleList([])
        for k in kernels:
            self.convs.append(
                nn.Sequential(OrderedDict([
                    ('conv', nn.Conv2d(channel, channel, kernel_size=k, padding=k // 2, groups=group)),
                    ('bn', nn.BatchNorm2d(channel)),
                    ('relu', nn.ReLU())
                ]))
            )
        self.fc = nn.Linear(channel, self.d)
        self.fcs = nn.ModuleList([])
        for i in range(len(kernels)):
            self.fcs.append(nn.Linear(self.d, channel))
        self.softmax = nn.Softmax(dim=0)
 
    def forward(self, x):
        bs, c, _, _ = x.size()
        conv_outs = []
        ### split
        for conv in self.convs:
            conv_outs.append(conv(x))
        feats = torch.stack(conv_outs, 0)  # k,bs,channel,h,w
 
        ### fuse
        U = sum(conv_outs)  # bs,c,h,w
 
        ### reduction channel
        S = U.mean(-1).mean(-1)  # bs,c
        Z = self.fc(S)  # bs,d
 
        ### calculate attention weight
        weights = []
        for fc in self.fcs:
            weight = fc(Z)
            weights.append(weight.view(bs, c, 1, 1))  # bs,channel
        attention_weughts = torch.stack(weights, 0)  # k,bs,channel,1,1
        attention_weughts = self.softmax(attention_weughts)  # k,bs,channel,1,1
 
        ### fuse
        V = (attention_weughts * feats).sum(0)
        return V
 
 
# ---------------------------Selective Kernel Attention End---------------------------
 
 
# ---------------------------ShuffleAttention Begin---------------------------
from torch.nn.parameter import Parameter
 
 
class ShuffleAttention(nn.Module):
 
    def __init__(self, channel=512, G=8):
        super().__init__()
        self.G = G
        self.channel = channel
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.gn = nn.GroupNorm(channel // (2 * G), channel // (2 * G))
        self.cweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.cbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.sbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sigmoid = nn.Sigmoid()
 
    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)
 
    @staticmethod
    def channel_shuffle(x, groups):
        b, c, h, w = x.shape
        x = x.reshape(b, groups, -1, h, w)
        x = x.permute(0, 2, 1, 3, 4)
 
        # flatten
        x = x.reshape(b, -1, h, w)
 
        return x
 
    def forward(self, x):
        b, c, h, w = x.size()
        # group into subfeatures
        x = x.view(b * self.G, -1, h, w)  # bs*G,c//G,h,w
 
        # channel_split
        x_0, x_1 = x.chunk(2, dim=1)  # bs*G,c//(2*G),h,w
 
        # channel attention
        x_channel = self.avg_pool(x_0)  # bs*G,c//(2*G),1,1
        x_channel = self.cweight * x_channel + self.cbias  # bs*G,c//(2*G),1,1
        x_channel = x_0 * self.sigmoid(x_channel)
 
        # spatial attention
        x_spatial = self.gn(x_1)  # bs*G,c//(2*G),h,w
        x_spatial = self.sweight * x_spatial + self.sbias  # bs*G,c//(2*G),h,w
        x_spatial = x_1 * self.sigmoid(x_spatial)  # bs*G,c//(2*G),h,w
 
        # concatenate along channel axis
        out = torch.cat([x_channel, x_spatial], dim=1)  # bs*G,c//G,h,w
        out = out.contiguous().view(b, -1, h, w)
 
        # channel shuffle
        out = self.channel_shuffle(out, 2)
        return out
 
 
# ---------------------------ShuffleAttention End---------------------------
 
 
# ---------------------------A2-Net  Begin---------------------------
class DoubleAttention(nn.Module):

    def __init__(self, in_channels, reconstruct=True):
        super().__init__()
        self.in_channels = in_channels
        c__ = int(in_channels * 0.5 * 0.5)  # hidden channels
        self.reconstruct = reconstruct
        self.c_m = c_m = c__
        self.c_n = c_n = c__
        self.convA = nn.Conv2d(in_channels, c_m, 1)
        self.convB = nn.Conv2d(in_channels, c_n, 1)
        self.convV = nn.Conv2d(in_channels, c_n, 1)
        if self.reconstruct:
            self.conv_reconstruct = nn.Conv2d(c_m, in_channels, kernel_size=1)
        self.init_weights()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                torch.nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                torch.nn.init.constant_(m.weight, 1)
                torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                torch.nn.init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)

    def forward(self, x):
        b, c, h, w = x.shape
        assert c == self.in_channels
        A = self.convA(x)  # b,c_m,h,w
        B = self.convB(x)  # b,c_n,h,w
        V = self.convV(x)  # b,c_n,h,w
        tmpA = A.view(b, self.c_m, -1)
        attention_maps = F.softmax(B.view(b, self.c_n, -1), dim=1)
        attention_vectors = F.softmax(V.view(b, self.c_n, -1), dim=1)
        # step 1: feature gating
        global_descriptors = torch.bmm(tmpA, attention_maps.permute(0, 2, 1))  # b.c_m,c_n
        # step 2: feature distribution
        tmpZ = global_descriptors.matmul(attention_vectors)  # b,c_m,h*w
        tmpZ = tmpZ.view(b, self.c_m, h, w)  # b,c_m,h,w
        if self.reconstruct:
            tmpZ = self.conv_reconstruct(tmpZ)

        return tmpZ
# ---------------------------A2-Net  End---------------------------
 
 

 
 
# ---------------------------RFB  Begin---------------------------
class BasicConv(nn.Module):
 
    def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0, dilation=1, groups=1, relu=True,
                 bn=True):
        super(BasicConv, self).__init__()
        self.out_channels = out_planes
        if bn:
            self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=padding,
                                  dilation=dilation, groups=groups, bias=False)
            self.bn = nn.BatchNorm2d(out_planes, eps=1e-5, momentum=0.01, affine=True)
            self.relu = nn.ReLU(inplace=True) if relu else None
        else:
            self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=padding,
                                  dilation=dilation, groups=groups, bias=True)
            self.bn = None
            self.relu = nn.ReLU(inplace=True) if relu else None
 
    def forward(self, x):
        x = self.conv(x)
        if self.bn is not None:
            x = self.bn(x)
        if self.relu is not None:
            x = self.relu(x)
        return x
 
 
class BasicRFB(nn.Module):
 
    def __init__(self, in_planes, out_planes, stride=1, scale=0.1, map_reduce=8, vision=1, groups=1):
        super(BasicRFB, self).__init__()
        self.scale = scale
        self.out_channels = out_planes
        inter_planes = in_planes // map_reduce
 
        self.branch0 = nn.Sequential(
            BasicConv(in_planes, inter_planes, kernel_size=1, stride=1, groups=groups, relu=False),
            BasicConv(inter_planes, 2 * inter_planes, kernel_size=(3, 3), stride=stride, padding=(1, 1), groups=groups),
            BasicConv(2 * inter_planes, 2 * inter_planes, kernel_size=3, stride=1, padding=vision + 1,
                      dilation=vision + 1, relu=False, groups=groups)
        )
        self.branch1 = nn.Sequential(
            BasicConv(in_planes, inter_planes, kernel_size=1, stride=1, groups=groups, relu=False),
            BasicConv(inter_planes, 2 * inter_planes, kernel_size=(3, 3), stride=stride, padding=(1, 1), groups=groups),
            BasicConv(2 * inter_planes, 2 * inter_planes, kernel_size=3, stride=1, padding=vision + 2,
                      dilation=vision + 2, relu=False, groups=groups)
        )
        self.branch2 = nn.Sequential(
            BasicConv(in_planes, inter_planes, kernel_size=1, stride=1, groups=groups, relu=False),
            BasicConv(inter_planes, (inter_planes // 2) * 3, kernel_size=3, stride=1, padding=1, groups=groups),
            BasicConv((inter_planes // 2) * 3, 2 * inter_planes, kernel_size=3, stride=stride, padding=1,
                      groups=groups),
            BasicConv(2 * inter_planes, 2 * inter_planes, kernel_size=3, stride=1, padding=vision + 4,
                      dilation=vision + 4, relu=False, groups=groups)
        )
 
        self.ConvLinear = BasicConv(6 * inter_planes, out_planes, kernel_size=1, stride=1, relu=False)
        self.shortcut = BasicConv(in_planes, out_planes, kernel_size=1, stride=stride, relu=False)
        self.relu = nn.ReLU(inplace=False)
 
    def forward(self, x):
        x0 = self.branch0(x)
        x1 = self.branch1(x)
        x2 = self.branch2(x)
 
        out = torch.cat((x0, x1, x2), 1)
        out = self.ConvLinear(out)
        short = self.shortcut(x)
        out = out * self.scale + short
        out = self.relu(out)
 
        return out
 
 
# ---------------------------RFB  End---------------------------
 
 

 
# ---------------------------CoTAttention Begin---------------------------
class CoTAttention(nn.Module):
 
    def __init__(self, dim=512, kernel_size=3):
        super().__init__()
        self.dim = dim
        self.kernel_size = kernel_size
 
        self.key_embed = nn.Sequential(
            nn.Conv2d(dim, dim, kernel_size=kernel_size, padding=kernel_size // 2, groups=4, bias=False),
            nn.BatchNorm2d(dim),
            nn.ReLU()
        )
        self.value_embed = nn.Sequential(
            nn.Conv2d(dim, dim, 1, bias=False),
            nn.BatchNorm2d(dim)
        )
 
        factor = 4
        self.attention_embed = nn.Sequential(
            nn.Conv2d(2 * dim, 2 * dim // factor, 1, bias=False),
            nn.BatchNorm2d(2 * dim // factor),
            nn.ReLU(),
            nn.Conv2d(2 * dim // factor, kernel_size * kernel_size * dim, 1)
        )
 
    def forward(self, x):
        bs, c, h, w = x.shape
        k1 = self.key_embed(x)  # bs,c,h,w
        v = self.value_embed(x).view(bs, c, -1)  # bs,c,h,w
 
        y = torch.cat([k1, x], dim=1)  # bs,2c,h,w
        att = self.attention_embed(y)  # bs,c*k*k,h,w
        att = att.reshape(bs, c, self.kernel_size * self.kernel_size, h, w)
        att = att.mean(2, keepdim=False).view(bs, c, -1)  # bs,c,h*w
        k2 = F.softmax(att, dim=-1) * v
        k2 = k2.view(bs, c, h, w)
 
        return k1 + k2
 
 
# ---------------------------CoTAttention End---------------------------
 
 
# ---------------------------EffectiveSEModule Begin---------------------------
from timm.models.layers.create_act import create_act_layer
 
 
class EffectiveSEModule(nn.Module):
    def __init__(self, channels, add_maxpool=False, gate_layer='hard_sigmoid'):
        super(EffectiveSEModule, self).__init__()
        self.add_maxpool = add_maxpool
        self.fc = nn.Conv2d(channels, channels, kernel_size=1, padding=0)
        self.gate = create_act_layer(gate_layer)
 
    def forward(self, x):
        x_se = x.mean((2, 3), keepdim=True)
        if self.add_maxpool:
            # experimental codepath, may remove or change
            x_se = 0.5 * x_se + 0.5 * x.amax((2, 3), keepdim=True)
        x_se = self.fc(x_se)
        return x * self.gate(x_se)
 
 
# ---------------------------EffectiveSEModule End---------------------------
 
 
# ---------------------------GlobalContext Begin---------------------------
from timm.models.layers.norm import LayerNorm2d
 
 
class GlobalContext(nn.Module):
 
    def __init__(self, channels, use_attn=True, fuse_add=False, fuse_scale=True, init_last_zero=False,
                 rd_ratio=1. / 8, rd_channels=None, rd_divisor=1, act_layer=nn.ReLU, gate_layer='sigmoid'):
        super(GlobalContext, self).__init__()
        act_layer = get_act_layer(act_layer)
 
        self.conv_attn = nn.Conv2d(channels, 1, kernel_size=1, bias=True) if use_attn else None
 
        if rd_channels is None:
            rd_channels = make_divisible(channels * rd_ratio, rd_divisor, round_limit=0.)
        if fuse_add:
            self.mlp_add = ConvMlp(channels, rd_channels, act_layer=act_layer, norm_layer=LayerNorm2d)
        else:
            self.mlp_add = None
        if fuse_scale:
            self.mlp_scale = ConvMlp(channels, rd_channels, act_layer=act_layer, norm_layer=LayerNorm2d)
        else:
            self.mlp_scale = None
 
        self.gate = create_act_layer(gate_layer)
        self.init_last_zero = init_last_zero
        self.reset_parameters()
 
    def reset_parameters(self):
        if self.conv_attn is not None:
            nn.init.kaiming_normal_(self.conv_attn.weight, mode='fan_in', nonlinearity='relu')
        if self.mlp_add is not None:
            nn.init.zeros_(self.mlp_add.fc2.weight)
 
    def forward(self, x):
        B, C, H, W = x.shape
 
        if self.conv_attn is not None:
            attn = self.conv_attn(x).reshape(B, 1, H * W)  # (B, 1, H * W)
            attn = F.softmax(attn, dim=-1).unsqueeze(3)  # (B, 1, H * W, 1)
            context = x.reshape(B, C, H * W).unsqueeze(1) @ attn
            context = context.view(B, C, 1, 1)
        else:
            context = x.mean(dim=(2, 3), keepdim=True)
 
        if self.mlp_scale is not None:
            mlp_x = self.mlp_scale(context)
            x = x * self.gate(mlp_x)
        if self.mlp_add is not None:
            mlp_x = self.mlp_add(context)
            x = x + mlp_x
 
        return x
 
 
# ---------------------------GlobalContext End---------------------------
 
# ---------------------------GatherExcite Begin---------------------------
from timm.models.layers.create_act import create_act_layer, get_act_layer
from timm.models.layers.create_conv2d import create_conv2d
from timm.models.layers.helpers import make_divisible
from timm.models.layers.mlp import ConvMlp
 
 
class GatherExcite(nn.Module):
    def __init__(
            self, channels, feat_size=None, extra_params=False, extent=0, use_mlp=True,
            rd_ratio=1. / 16, rd_channels=None, rd_divisor=1, add_maxpool=False,
            act_layer=nn.ReLU, norm_layer=nn.BatchNorm2d, gate_layer='sigmoid'):
        super(GatherExcite, self).__init__()
        self.add_maxpool = add_maxpool
        act_layer = get_act_layer(act_layer)
        self.extent = extent
        if extra_params:
            self.gather = nn.Sequential()
            if extent == 0:
                assert feat_size is not None, 'spatial feature size must be specified for global extent w/ params'
                self.gather.add_module(
                    'conv1', create_conv2d(channels, channels, kernel_size=feat_size, stride=1, depthwise=True))
                if norm_layer:
                    self.gather.add_module(f'norm1', nn.BatchNorm2d(channels))
            else:
                assert extent % 2 == 0
                num_conv = int(math.log2(extent))
                for i in range(num_conv):
                    self.gather.add_module(
                        f'conv{i + 1}',
                        create_conv2d(channels, channels, kernel_size=3, stride=2, depthwise=True))
                    if norm_layer:
                        self.gather.add_module(f'norm{i + 1}', nn.BatchNorm2d(channels))
                    if i != num_conv - 1:
                        self.gather.add_module(f'act{i + 1}', act_layer(inplace=True))
        else:
            self.gather = None
            if self.extent == 0:
                self.gk = 0
                self.gs = 0
            else:
                assert extent % 2 == 0
                self.gk = self.extent * 2 - 1
                self.gs = self.extent
 
        if not rd_channels:
            rd_channels = make_divisible(channels * rd_ratio, rd_divisor, round_limit=0.)
        self.mlp = ConvMlp(channels, rd_channels, act_layer=act_layer) if use_mlp else nn.Identity()
        self.gate = create_act_layer(gate_layer)
 
    def forward(self, x):
        size = x.shape[-2:]
        if self.gather is not None:
            x_ge = self.gather(x)
        else:
            if self.extent == 0:
                # global extent
                x_ge = x.mean(dim=(2, 3), keepdims=True)
                if self.add_maxpool:
                    # experimental codepath, may remove or change
                    x_ge = 0.5 * x_ge + 0.5 * x.amax((2, 3), keepdim=True)
            else:
                x_ge = F.avg_pool2d(
                    x, kernel_size=self.gk, stride=self.gs, padding=self.gk // 2, count_include_pad=False)
                if self.add_maxpool:
                    # experimental codepath, may remove or change
                    x_ge = 0.5 * x_ge + 0.5 * F.max_pool2d(x, kernel_size=self.gk, stride=self.gs, padding=self.gk // 2)
        x_ge = self.mlp(x_ge)
        if x_ge.shape[-1] != 1 or x_ge.shape[-2] != 1:
            x_ge = F.interpolate(x_ge, size=size)
        return x * self.gate(x_ge)
 
 
# ---------------------------GatherExcite End---------------------------
 
 
# ---------------------------MHSA Begin---------------------------
class MHSA(nn.Module):
    def __init__(self, n_dims, width=14, height=14, heads=4, pos_emb=False):
        super(MHSA, self).__init__()
 
        self.heads = heads
        self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.pos = pos_emb
        if self.pos:
            self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, 1, int(height)]),
                                             requires_grad=True)
            self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, int(width), 1]),
                                             requires_grad=True)
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        n_batch, C, width, height = x.size()
        q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
        k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
        v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
        content_content = torch.matmul(q.permute(0, 1, 3, 2), k)  # 1,C,h*w,h*w
        c1, c2, c3, c4 = content_content.size()
        if self.pos:
            content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(
                0, 1, 3, 2)  # 1,4,1024,64
 
            content_position = torch.matmul(content_position, q)  # ([1, 4, 1024, 256])
            content_position = content_position if (
                    content_content.shape == content_position.shape) else content_position[:, :, :c3, ]
            assert (content_content.shape == content_position.shape)
            energy = content_content + content_position
        else:
            energy = content_content
        attention = self.softmax(energy)
        out = torch.matmul(v, attention.permute(0, 1, 3, 2))  # 1,4,256,64
        out = out.view(n_batch, C, width, height)
        return out
 
 
# ---------------------------MHSA End---------------------------
 
 
# ---------------------------ParNetAttention Begin---------------------------
class ParNetAttention(nn.Module):
 
    def __init__(self, channel=512):
        super().__init__()
        self.sse = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channel, channel, kernel_size=1),
            nn.Sigmoid()
        )
 
        self.conv1x1 = nn.Sequential(
            nn.Conv2d(channel, channel, kernel_size=1),
            nn.BatchNorm2d(channel)
        )
        self.conv3x3 = nn.Sequential(
            nn.Conv2d(channel, channel, kernel_size=3, padding=1),
            nn.BatchNorm2d(channel)
        )
        self.silu = nn.SiLU()
 
    def forward(self, x):
        b, c, _, _ = x.size()
        x1 = self.conv1x1(x)
        x2 = self.conv3x3(x)
        x3 = self.sse(x) * x
        y = self.silu(x1 + x2 + x3)
        return y
 
 
# ---------------------------ParNetAttention End---------------------------
 
 
# ---------------------------ParallelPolarizedSelfAttention Begin---------------------------
class ParallelPolarizedSelfAttention(nn.Module):
 
    def __init__(self, channel=512):
        super().__init__()
        self.ch_wv = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.ch_wq = nn.Conv2d(channel, 1, kernel_size=(1, 1))
        self.softmax_channel = nn.Softmax(1)
        self.softmax_spatial = nn.Softmax(-1)
        self.ch_wz = nn.Conv2d(channel // 2, channel, kernel_size=(1, 1))
        self.ln = nn.LayerNorm(channel)
        self.sigmoid = nn.Sigmoid()
        self.sp_wv = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.sp_wq = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.agp = nn.AdaptiveAvgPool2d((1, 1))
 
    def forward(self, x):
        b, c, h, w = x.size()
 
        # Channel-only Self-Attention
        channel_wv = self.ch_wv(x)  # bs,c//2,h,w
        channel_wq = self.ch_wq(x)  # bs,1,h,w
        channel_wv = channel_wv.reshape(b, c // 2, -1)  # bs,c//2,h*w
        channel_wq = channel_wq.reshape(b, -1, 1)  # bs,h*w,1
        channel_wq = self.softmax_channel(channel_wq)
        channel_wz = torch.matmul(channel_wv, channel_wq).unsqueeze(-1)  # bs,c//2,1,1
        channel_weight = self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b, c, 1).permute(0, 2, 1))).permute(0, 2,
                                                                                                                 1).reshape(
            b, c, 1, 1)  # bs,c,1,1
        channel_out = channel_weight * x
 
        # Spatial-only Self-Attention
        spatial_wv = self.sp_wv(x)  # bs,c//2,h,w
        spatial_wq = self.sp_wq(x)  # bs,c//2,h,w
        spatial_wq = self.agp(spatial_wq)  # bs,c//2,1,1
        spatial_wv = spatial_wv.reshape(b, c // 2, -1)  # bs,c//2,h*w
        spatial_wq = spatial_wq.permute(0, 2, 3, 1).reshape(b, 1, c // 2)  # bs,1,c//2
        spatial_wq = self.softmax_spatial(spatial_wq)
        spatial_wz = torch.matmul(spatial_wq, spatial_wv)  # bs,1,h*w
        spatial_weight = self.sigmoid(spatial_wz.reshape(b, 1, h, w))  # bs,1,h,w
        spatial_out = spatial_weight * x
        out = spatial_out + channel_out
        return out
 
 
# ---------------------------ParallelPolarizedSelfAttention End---------------------------
 
# ---------------------------SpatialGroupEnhance Begin---------------------------
class SpatialGroupEnhance(nn.Module):
    def __init__(self, groups=8):
        super().__init__()
        self.groups = groups
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.weight = nn.Parameter(torch.zeros(1, groups, 1, 1))
        self.bias = nn.Parameter(torch.zeros(1, groups, 1, 1))
        self.sig = nn.Sigmoid()
        self.init_weights()
 
    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)
 
    def forward(self, x):
        b, c, h, w = x.shape
        x = x.view(b * self.groups, -1, h, w)  # bs*g,dim//g,h,w
        xn = x * self.avg_pool(x)  # bs*g,dim//g,h,w
        xn = xn.sum(dim=1, keepdim=True)  # bs*g,1,h,w
        t = xn.view(b * self.groups, -1)  # bs*g,h*w
 
        t = t - t.mean(dim=1, keepdim=True)  # bs*g,h*w
        std = t.std(dim=1, keepdim=True) + 1e-5
        t = t / std  # bs*g,h*w
        t = t.view(b, self.groups, h, w)  # bs,g,h*w
 
        t = t * self.weight + self.bias  # bs,g,h*w
        t = t.view(b * self.groups, 1, h, w)  # bs*g,1,h*w
        x = x * self.sig(t)
        x = x.view(b, c, h, w)
        return x
 
 
# ---------------------------SpatialGroupEnhance End---------------------------
 
 
# ---------------------------SequentialPolarizedSelfAttention Begin---------------------------
class SequentialPolarizedSelfAttention(nn.Module):
 
    def __init__(self, channel=512):
        super().__init__()
        self.ch_wv = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.ch_wq = nn.Conv2d(channel, 1, kernel_size=(1, 1))
        self.softmax_channel = nn.Softmax(1)
        self.softmax_spatial = nn.Softmax(-1)
        self.ch_wz = nn.Conv2d(channel // 2, channel, kernel_size=(1, 1))
        self.ln = nn.LayerNorm(channel)
        self.sigmoid = nn.Sigmoid()
        self.sp_wv = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.sp_wq = nn.Conv2d(channel, channel // 2, kernel_size=(1, 1))
        self.agp = nn.AdaptiveAvgPool2d((1, 1))
 
    def forward(self, x):
        b, c, h, w = x.size()
 
        # Channel-only Self-Attention
        channel_wv = self.ch_wv(x)  # bs,c//2,h,w
        channel_wq = self.ch_wq(x)  # bs,1,h,w
        channel_wv = channel_wv.reshape(b, c // 2, -1)  # bs,c//2,h*w
        channel_wq = channel_wq.reshape(b, -1, 1)  # bs,h*w,1
        channel_wq = self.softmax_channel(channel_wq)
        channel_wz = torch.matmul(channel_wv, channel_wq).unsqueeze(-1)  # bs,c//2,1,1
        channel_weight = self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b, c, 1).permute(0, 2, 1))).permute(0, 2,
                                                                                                                 1).reshape(
            b, c, 1, 1)  # bs,c,1,1
        channel_out = channel_weight * x
 
        # Spatial-only Self-Attention
        spatial_wv = self.sp_wv(channel_out)  # bs,c//2,h,w
        spatial_wq = self.sp_wq(channel_out)  # bs,c//2,h,w
        spatial_wq = self.agp(spatial_wq)  # bs,c//2,1,1
        spatial_wv = spatial_wv.reshape(b, c // 2, -1)  # bs,c//2,h*w
        spatial_wq = spatial_wq.permute(0, 2, 3, 1).reshape(b, 1, c // 2)  # bs,1,c//2
        spatial_wq = self.softmax_spatial(spatial_wq)
        spatial_wz = torch.matmul(spatial_wq, spatial_wv)  # bs,1,h*w
        spatial_weight = self.sigmoid(spatial_wz.reshape(b, 1, h, w))  # bs,1,h,w
        spatial_out = spatial_weight * channel_out
        return spatial_out
 
 
# ---------------------------SequentialPolarizedSelfAttention End---------------------------
 
 
# ---------------------------TripletAttention Begin---------------------------
class BasicConv_T(nn.Module):
    def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0, dilation=1, groups=1, relu=True,
                 bn=True, bias=False):
        super(BasicConv_T, self).__init__()
        self.out_channels = out_planes
        self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=padding,
                              dilation=dilation, groups=groups, bias=bias)
        self.bn = nn.BatchNorm2d(out_planes, eps=1e-5, momentum=0.01, affine=True) if bn else None
        self.relu = nn.ReLU() if relu else None
 
    def forward(self, x):
        x = self.conv(x)
        if self.bn is not None:
            x = self.bn(x)
        if self.relu is not None:
            x = self.relu(x)
        return x
 
 
class ZPool(nn.Module):
    def forward(self, x):
        return torch.cat((torch.max(x, 1)[0].unsqueeze(1), torch.mean(x, 1).unsqueeze(1)), dim=1)
 
 
class AttentionGate(nn.Module):
    def __init__(self):
        super(AttentionGate, self).__init__()
        kernel_size = 7
        self.compress = ZPool()
        self.conv = BasicConv_T(2, 1, kernel_size, stride=1, padding=(kernel_size - 1) // 2, relu=False)
 
    def forward(self, x):
        x_compress = self.compress(x)
        x_out = self.conv(x_compress)
        scale = torch.sigmoid_(x_out)
        return x * scale
 
 
class TripletAttention(nn.Module):
    def __init__(self, no_spatial=False):
        super(TripletAttention, self).__init__()
        self.cw = AttentionGate()
        self.hc = AttentionGate()
        self.no_spatial = no_spatial
        if not no_spatial:
            self.hw = AttentionGate()
 
    def forward(self, x):
        x_perm1 = x.permute(0, 2, 1, 3).contiguous()
        x_out1 = self.cw(x_perm1)
        x_out11 = x_out1.permute(0, 2, 1, 3).contiguous()
        x_perm2 = x.permute(0, 3, 2, 1).contiguous()
        x_out2 = self.hc(x_perm2)
        x_out21 = x_out2.permute(0, 3, 2, 1).contiguous()
        if not self.no_spatial:
            x_out = self.hw(x)
            x_out = 1 / 3 * (x_out + x_out11 + x_out21)
        else:
            x_out = 1 / 2 * (x_out11 + x_out21)
        return x_out
 
# ---------------------------TripletAttention End---------------------------

yolo.py

添加到 yolo.py 中的 nn.BatchNorm2d 下方：

在这里插入图片描述

		# by CSDN 迪菲赫尔曼 请勿散播转发，仅供学习交流
        # ------------Attention ↆ------------
        elif m in [SimAM, ECA, SpatialGroupEnhance,
                   TripletAttention]:  
            args = [*args[:]]
        elif m in [CoordAtt, GAMAttention]: 
            c1, c2 = ch[f], args[0]
            if c2 != no:
                c2 = make_divisible(c2 * gw, 8)
            args = [c1, c2, *args[1:]]
        elif m in [SE, ShuffleAttention, CBAM, SKAttention, DoubleAttention, CoTAttention, EffectiveSEModule,
                   GlobalContext, GatherExcite, MHSA]:  
            c1 = ch[f]
            args = [c1, *args[0:]]
        elif m in [S2Attention, NAMAttention, CrissCrossAttention, SequentialPolarizedSelfAttention,
                   ParallelPolarizedSelfAttention, ParNetAttention]: 
            c1 = ch[f]
            args = [c1]
        # ------------Attention ↑--------------

注意力配置文件写法表

args.yaml
# by CSDN 迪菲赫尔曼
# 请勿散播转发，仅供学习交流

SE:  # https://arxiv.org/abs/1709.01507
  - [-1, 1, SE, [16]],

CBAM:  # https://openaccess.thecvf.com/content_ECCV_2018/papers/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.pdf
  - [-1, 1, CBAM, [7]],

BAM:  # https://arxiv.org/pdf/1807.06514.pdf
  # [-1, 1, BAMBlock, [16, 1]], bug

ECA:  # https://arxiv.org/pdf/1910.03151.pdf
  - [-1, 1, ECA, [3]],

SimAM:  # http://proceedings.mlr.press/v139/yang21o/yang21o.pdf
  - [-1, 1, SimAM, [1e-4]],

SKAttention:  # https://arxiv.org/pdf/1903.06586.pdf
  - [-1, 1, SKAttention, [[1, 3, 5, 7], 16, 1, 32]],

ShuffleAttention:  # https://arxiv.org/pdf/2102.00240.pdf
  - [-1, 1, ShuffleAttention, [8]],

DoubleAttention:  # https://arxiv.org/pdf/1810.11579.pdf
  - [-1, 1, DoubleAttention, [True]],

CoTAttention:  # https://arxiv.org/abs/2107.12292
  - [-1, 1, CoTAttention, [3]],

EffectiveSE:   # https://arxiv.org/abs/1911.06667
  - [-1, 1, EffectiveSEModule, [False,'hard_sigmoid']],

GlobalContext:  # https://arxiv.org/abs/1904.11492
  - [-1, 1, GlobalContext, [True, False, True, False, 1./8, None, 1, nn.ReLU, 'sigmoid']],

GatherExcite:  # https://arxiv.org/abs/1911.06667
  - [-1, 1, GatherExcite, [None, False, 0, True, 1./16, None, 1, False, nn.ReLU, nn.BatchNorm2d, 'sigmoid']],

MHSA:    # https://arxiv.org/abs/2107.00782
  - [-1, 1, MHSA, [14, 14 ,4 , False]],

TripletAttention:   # https://arxiv.org/abs/2108.01072
  - [-1, 1, TripletAttention, [False]],

SpatialGroupEnhance:  # https://arxiv.org/pdf/1905.09646.pdf
  - [-1, 1, SpatialGroupEnhance, [8]],


# -----------↓ channels = upper layer channels---------------------------#
NAMAttention:   # https://arxiv.org/abs/2111.12419                       #
  - [-1, 1, NAMAttention, [channel]],                                       #
                                                                         #
ParNetAttention:  # https://arxiv.org/abs/2110.07641                     #
  - [-1, 1, ParNetAttention, [channel]],                                    #
                                                                         #
S2Attention:  # https://arxiv.org/abs/2108.01072                         #
  - [-1, 1, S2Attention, [channel]],                                        #
                                                                         #
CrissCrossAttention:  # https://arxiv.org/abs/1811.11721                 #
  - [-1, 1, CrissCrossAttention, [channel]],                                #
                                                                         #
CA:   # https://arxiv.org/abs/2110.07641                                 #
  - [-1, 1, CoordAtt, [channel,32]],                                        #
                                                                         #
GAMAttention:   # https://arxiv.org/pdf/2112.05561v1.pdf                 #
  - [-1, 1, GAMAttention, [channel, True, 4]],                              #
                                                                         #
PolarizedSelfAttention:  # https://arxiv.org/abs/2107.00782              #
  - [-1, 1, ParallelPolarizedSelfAttention, [channel]],                     #
                                                                         #
SequentialPolarizedSelfAttention:  # https://arxiv.org/abs/2107.00782    #
  - [-1, 1, SequentialPolarizedSelfAttention, [channel]],                   #
# -----------↑ channels = upper layer channels---------------------------#

写法表如何使用？

以模板3为例，假如我想在模板3中添加SE模块，我只需要在args.yaml中复制[-1, 1, SE, [16]],，然后将其替换模板的[-1, 1, Attention, [args]],即可，不需要考虑通道信息。

NAMAttention、S2Attention、CrissCrossAttention、CA GAMAttention、PolarizedSelfAttention、SequentialPolarizedSelfAttention需要考虑通道信息，即把channel设置为你插入位置的上一层通道数。

YOLOv5模板

yolov5-template-Backbone

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone + three Attention modules
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],    # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],    # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Attention, [args]],    # ---> You can add your attention module name here
   [-1, 1, Conv, [512, 3, 2]],    # 6-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Attention, [args]],    # ---> You can add your attention module name here
   [-1, 1, Conv, [1024, 3, 2]],   # 9-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],      # 11
   [-1, 1, Attention, [args]],    # ---> You can add your attention module name here
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 8], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 16

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 5], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 20 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 17], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 23 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 13], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 26 (P5/32-large)

   [[20, 23, 26], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

yolov5-template-Neck

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024,5]],     # 9
  ]

# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)
   [-1, 1, Attention, [args]],    # 18  ---> You can add your attention module name here

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]], # cat head P4
   [-1, 3, C3, [512, False]],  # 21 (P4/16-medium)
   [-1, 1, Attention, [args]],    # 22  ---> You can add your attention module name here

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]], # 24 cat head P5
   [-1, 3, C3, [1024, False]], # 25 (P5/32-large)
   [-1, 1, Attention, [args]],  # 26   ---> You can add your attention module name here

   [[18, 22, 26], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

yolov5-template-SPP

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],    # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],    # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],    # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],   # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, Attention, [args]],    # ---> You can add your attention module name here
   [-1, 1, SPPF, [1024, 5]],      # 10
  ]

# YOLOv5 v6.1 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 14

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 18 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 15], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 21 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 11], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 24 (P5/32-large)

   [[18, 21, 24], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]