深入浅出之Resnet网络

浩瀚之水_csdn

已于 2024-09-13 11:17:34 修改

阅读量147

点赞数 4

分类专栏： # 深度学习-深度学习基础知识 # Pytorch框架深度学习目标检测文章标签：深度学习

于 2024-09-13 09:52:14 首次发布

本文链接：https://blog.csdn.net/a8039974/article/details/142202414

版权

深度学习目标检测同时被 3 个专栏收录

89 篇文章 8 订阅

订阅专栏

Pytorch框架

76 篇文章 16 订阅

订阅专栏

深度学习-深度学习基础知识

48 篇文章 3 订阅

订阅专栏

ResNet（Residual Network，残差网络）是一种由微软亚洲研究院提出的深度神经网络结构，其核心在于通过残差连接（residual connections）解决了深层网络训练中的梯度消失和梯度爆炸问题，使得网络可以训练得更深，性能更强。以下是ResNet网络原理、特点、优缺点及应用场景的详细解析：

一、ResNet网络原理

ResNet的核心思想是通过残差连接将输入信号直接传递到后面的层，使得网络可以学习到残差而不是全局特征。具体来说，ResNet在每个残差块中引入了一个跨层连接（也称为跳跃连接或shortcut connection），将输入信号直接添加到残差块的输出上。这种设计使得网络在反向传播时能够更容易地传递梯度，从而解决了深层网络训练中的梯度消失问题。

1.1 深度学习退化现象

在深度学习中，退化现象主要指的是随着神经网络层数的增加，网络性能反而下降的情况。这种现象与人们的直觉相悖，因为通常认为更深的网络能够学习到更复杂的特征表示，从而提高模型的性能。然而，在实际应用中，过深的网络往往会导致梯度消失、梯度爆炸、过拟合以及低级特征丢失等问题，进而引发性能退化。

深度学习退化现象的原因

梯度问题：随着网络层数的增加，梯度在反向传播过程中可能会逐渐变小（梯度消失）或变大（梯度爆炸），导致网络无法进行有效的参数更新，从而影响模型的性能。
过拟合：深度网络具有强大的表达能力，容易在训练数据上过度拟合，导致在测试集上的性能下降。
低级特征丢失：过深的网络可能过于关注高级特征的学习和表达，而忽略了低级特征的重要性，从而丧失了一些有效的特征表示能力。

深度学习退化现象的解决方案

为了解决深度学习中的退化现象，研究者们提出了多种解决方案，其中最具代表性的是残差网络（ResNet）及其恒等映射机制。

残差网络（ResNet）：ResNet通过引入残差结构和恒等映射，使得网络在加深时能够保持或提升性能。残差块中的恒等映射允许输入直接传递到输出，与经过卷积层处理的特征相加，从而缓解了梯度消失和梯度爆炸问题，并有助于保留低级特征。
参数初始化和归一化：合适的参数初始化和归一化方法可以改善模型的稳定性和收敛速度。例如，使用符合高斯分布的初始化方法，并结合批量归一化技术（Batch Normalization），可以使得网络的训练更加稳定，减少退化问题的发生。
网络架构设计：通过设计更深层次的网络架构，引入更多的非线性变换和特征交互，可以提升模型的表达能力和准确率。然而，这也需要谨慎控制网络的复杂度，以避免过拟合和计算成本的增加。
数据增强和正则化：数据增强和正则化技术可以帮助防止过拟合的发生，提高模型的泛化能力。对于退化问题，适当的数据增强和正则化方法可以减少模型对训练数据的过度依赖，提高性能和鲁棒性。

综上所述，深度学习中的退化现象是一个复杂而重要的问题，需要研究者们不断探索和创新来寻找更有效的解决方案。随着深度学习技术的不断发展和完善，我们有理由相信退化问题将得到更好的解决，深度网络的性能和应用将不断提升。

1.2 梯度消失

梯度消失（Gradient Vanishing）是神经网络训练中，特别是在深层神经网络中常见的一个问题。它指的是在网络的反向传播过程中，用于更新网络权重的梯度变得非常小，以至于几乎不对权重产生任何显著的更新。这种现象通常发生在深层网络的较低层（即靠近输入层的层）。

产生原因

深层网络结构：在深层网络中，梯度必须通过多个层次进行反向传播。由于链式法则的作用，当层数很深时，梯度在传播过程中可能会逐渐减小，最终变得非常小，甚至接近于零。
激活函数的选择：某些激活函数（如sigmoid函数）的梯度在输入值远离其中心点时会变得非常小，这也会导致梯度消失的问题。当这些激活函数被用于深层网络时，梯度消失的问题会更加明显。

影响

梯度消失会导致网络在训练过程中无法有效地更新权重，从而影响网络的性能。具体来说，较低层的权重可能无法得到充分的训练，因为它们接收到的梯度非常小。这会导致网络无法学习到有效的特征表示，进而影响最终的预测结果。

解决方案

为了解决梯度消失的问题，研究者们提出了多种方法，包括：

改变激活函数：使用梯度不易消失的激活函数，如ReLU（Rectified Linear Unit）及其变体（如Leaky ReLU、PReLU等）。这些激活函数在输入为正时梯度为常数，可以有效避免梯度消失的问题。
残差连接：通过引入残差连接（Residual Connections），将当前层的输出与前一层的输入直接相加，形成残差块。这种结构有助于梯度在反向传播过程中直接跳过某些层，从而缓解梯度消失的问题。
批量正则化：批量正则化（Batch Normalization）通过对每一层的输入进行归一化处理，使得每一层的输入分布都保持在一个稳定的范围内。这有助于减少梯度消失的问题，并加速网络的训练过程。
优化器选择：选择合适的优化器（如Adam、RMSprop等）也可以在一定程度上缓解梯度消失的问题。这些优化器通常具有自适应的学习率调整机制，可以根据梯度的大小自动调整学习率，从而避免梯度消失或梯度爆炸的问题。

总之，梯度消失是深度神经网络训练中需要特别注意的问题。通过选择合适的激活函数、引入残差连接、应用批量正则化以及选择合适的优化器等方法，可以有效地缓解这一问题，提高网络的训练效率和性能。

1.3 残差学习

在深度学习中，残差学习（Residual Learning）是一种重要的技术，它通过引入残差网络（Residual Networks，简称ResNet）来解决传统深度神经网络在训练过程中遇到的梯度消失和梯度爆炸问题，从而提高网络的训练效果和性能。以下是关于深度学习残差学习的详细解析：

残差学习的概念

残差学习是指在网络中通过引入残差块（Residual Block），使得网络能够学习输入与输出之间的残差，而不是直接学习完整的输出。残差块通过跳跃连接（Skip Connection）将输入直接连接到输出，与经过网络处理的特征相加，形成残差学习的基本结构。

残差学习的原理

解决梯度消失和梯度爆炸：
- 传统深度神经网络在训练过程中，随着层数的增加，梯度在反向传播过程中可能会逐渐减小（梯度消失）或增大（梯度爆炸），导致网络难以训练。
- 残差学习通过引入跳跃连接，使得梯度可以直接从深层传递到浅层，避免了梯度在传递过程中的损失和爆炸，从而解决了梯度消失和梯度爆炸问题。
提高网络性能：
- 残差网络通过堆叠多个残差块来构建深层网络，每个残差块都能学习到输入与输出之间的残差，这种学习方式有助于网络更好地捕捉数据中的特征信息。
- 残差学习使得网络在加深时能够保持或提升性能，避免了传统深度神经网络在加深时出现的性能退化现象。

残差网络的结构

残差网络由多个残差块堆叠而成，每个残差块通常包含多个卷积层、批量归一化层（Batch Normalization）和激活函数层。残差块的核心是跳跃连接，它将输入直接连接到输出，与经过卷积层处理的特征相加。这种结构使得网络在训练过程中能够保持信息的流通性，避免了梯度消失和梯度爆炸问题。

残差学习的应用

残差学习被广泛应用于计算机视觉、自然语言处理、语音识别等多个领域。在计算机视觉领域，残差网络在图像分类、目标检测、人脸识别等任务中取得了优异的成绩。在自然语言处理领域，残差学习也被用于文本分类、情感分析、机器翻译等任务中。

未来展望

随着深度学习技术的不断发展，残差学习在更多领域的应用将得到拓展。未来，研究者们将继续探索残差网络的优化方法，如改进残差块的设计、优化网络的宽度与深度、引入正则化与归一化策略等，以进一步提高残差网络的性能和泛化能力。同时，随着深度学习技术的广泛应用，残差学习将在更多实际场景中发挥重要作用，推动人工智能技术的进一步发展。

1.4 恒等映射

恒等映射（Identical Mapping），也被称为恒等函数或单位映射，是数学中的一个重要概念，尤其在集合论和函数论中占据重要地位。以下是对恒等映射的详细解析：

定义

对于任意集合A，如果映射f:A→A定义为f(a)=a，即A中的每个元素a都与其自身对应，则称f为A上的恒等映射。简单来说，恒等映射就是“原象”与“象”完全相同的映射。

性质

唯一性：对于任何集合A，都存在唯一的恒等映射。
双射性：恒等映射是双射的，即它既是单射（每个元素都映射到唯一的元素）也是满射（每个元素都有元素映射到它）。
线性性：在实数集R或更广泛的线性空间上，恒等映射表现为一条通过原点、斜率为1的直线，具有线性性质。

应用

基准任务：恒等映射可以作为一个简单的基准任务来评估和分析网络的一些重要性质。例如，在深度学习中，当网络试图学习一个接近恒等映射的复杂映射时，可以通过比较网络输出与输入之间的差异来评估网络的性能。
残差网络：在残差网络（ResNet）中，恒等映射起到了关键作用。通过引入跳跃连接（Skip Connection），残差网络使得前一层的输出可以直接与后一层的输出相加，从而避免了在深层网络中出现的梯度消失问题。这种设计使得网络能够更容易地学习复杂的映射关系。
函数分析：在函数分析中，恒等映射是理解函数性质的重要工具。通过比较函数与恒等映射之间的差异，可以揭示函数的非线性特征、增长速率等关键信息。

示例

在实数集R上，恒等映射可以表示为y=f(x)=x。这是一个非常简单的函数，但它却具有非常重要的意义。对于任何实数x，它都映射到其自身y=x。这个函数的图像是一条通过原点的直线，斜率为1。

结论

恒等映射是数学中的一个基本概念，具有唯一性、双射性和线性性等重要性质。在深度学习、函数分析等领域中，恒等映射都扮演着重要角色。通过深入理解恒等映射的概念和性质，我们可以更好地应用它来解决实际问题。

1.4 残差模块映射

残差模块包含两种映射，identity mapping和residual mapping，综合形成 $x\rightarrow y=F(x)+x$

identity mapping实际基于skip connection跳跃连接理论

residual mapping残差可以理解为 $y-x$ ，即 $F(x)$

1.5 实线残差模块

左图模块用于浅层网络（ResNet34），右图模块用于深层网络（如ResNet101）

浅层实线残差模块深层实线残差模块
右侧模块中，能够减少参数和运算量，1*1的卷积和用于升维/降维。

输入channel为256的特征矩阵，左侧模块需要1170648个参数，右侧模块需要69632个参数。

1.6 虚线残差模块

左图模块用于浅层网络（ResNet34），右图模块用于深层网络（如ResNet101）

浅层虚线残差模块深层虚线残差模块
虚线残差结构在跳跃连接分支上加入1*1卷积核进行降维

注意虚线残差模块中各层步距stride与实线残差模块的区别

注意原论文中，右侧虚线残差结构的主分支上，第一个1×1卷积层的步距是2，第二个3×3卷积层的步距是1；
但在pytorch官方实现过程中，第一个1×1卷积层的步距是1，第二个3×3卷积层的步距是2，这样能够在ImageNet的top1上提升大概0.5%的准确率。

区别：

虚线残差结构（conv3_x、conv4_x、conv5_x第一层）将图像的高、宽和深度都改变了
实线残差结构的输入、输出特征矩阵维度是一样的，故可以直接进行相加

二、ResNet特点

残差连接：ResNet通过残差连接实现了信息的直接传递，避免了特征逐层消失的问题。
深度：ResNet可以构建非常深的网络结构，通过不断增加网络深度来提高性能。
残差块：ResNet的基本组成单元是残差块，每个残差块包含多个卷积层和一个跨层连接。
高效性：ResNet在保持网络性能的同时，通过残差块的设计减少了计算量，提高了计算效率。

三、ResNet优缺点

优点

可以训练非常深的网络：通过残差连接，ResNet解决了深层网络训练中的梯度消失问题，使得网络可以训练得更深。
提高了模型的表达能力和性能：随着网络深度的增加，ResNet的性能不断提升，在多个计算机视觉任务中取得了优异的效果。
避免了梯度消失和梯度爆炸问题：残差连接使得网络在反向传播时能够更容易地传递梯度，从而避免了这些问题。
训练更加稳定：残差连接使得网络的学习过程更加顺畅和稳定，有助于提高模型的精度和泛化能力。

缺点

需要大量的计算资源：随着网络深度的增加，ResNet的训练和推理过程需要更多的计算资源。
过拟合风险：在某些情况下，ResNet可能会出现过拟合问题，需要通过正则化等方法进行处理。
存在冗余：有研究表明，在深度残差网络中存在大量的冗余层，这些层对于网络的性能提升并不显著。
感受野问题：虽然ResNet通过堆叠多层网络增加了理论上的感受野，但实际上的有效感受野可能并不如预期那么大。

四、pytorch实现

4.1 定义残差结构：

18/34 和 50/101/152 残差结构是不一样的

conv3_x、conv4_x、conv5_x残差结构的第一层对应的都是虚线残差结构
stride=1对应实线残差结构，此时卷积处理不会改变图像的高和宽
stride=2对应虚线残差结构，此时高和宽会缩减为原来的一半

ResNet-18/34（BasicBlock）：

# ResNet-18/34 残差结构 BasicBlock
class BasicBlock(nn.Module):
    expansion = 1   # 残差结构中主分支所采用的卷积核的个数是否发生变化。对于浅层网络，每个残差结构的第一层和第二层卷积核个数一样，故是1

    # 定义初始函数
    # in_channel输入特征矩阵深度，out_channel输出特征矩阵深度（即主分支卷积核个数）
    def __init__(self, in_channel, out_channel, stride=1, downsample=None):   # downsample对应虚线残差结构捷径中的1×1卷积
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,
                               kernel_size=3, stride=stride, padding=1, bias=False)  # 使用bn层时不使用bias
        self.bn1 = nn.BatchNorm2d(out_channel)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=1, padding=1, bias=False)  # 实/虚线残差结构主分支中第二层stride都为1
        self.bn2 = nn.BatchNorm2d(out_channel)
        self.downsample = downsample   # 默认是None

# 定义正向传播过程
    def forward(self, x):
        identity = x   # 捷径分支的输出值
        if self.downsample is not None:   # 对应虚线残差结构
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)   # 这里不经过relu激活函数

        out += identity
        out = self.relu(out)

        return out

ResNet-50/101/152（Bottleneck）：

# ResNet-50/101/152 残差结构 Bottleneck
class Bottleneck(nn.Module):
    """
    注意：原论文中，在虚线残差结构的主分支上，第一个1x1卷积层的步距是2，第二个3x3卷积层步距是1。
    但在pytorch官方实现过程中是第一个1x1卷积层的步距是1，第二个3x3卷积层步距是2，
    这么做的好处是能够在top1上提升大概0.5%的准确率。
    可参考Resnet v1.5 https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch
    """
    expansion = 4    # 第三层的卷积核个数是第一层、第二层的四倍

    def __init__(self, in_channel, out_channel, stride=1, downsample=None):
        super(Bottleneck, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,   # out_channels是第一、二层的卷积核个数
                               kernel_size=1, stride=1, bias=False)  # squeeze channels  高和宽不变
        self.bn1 = nn.BatchNorm2d(out_channel)
        # -----------------------------------------
        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=stride, bias=False, padding=1)   # 实线stride为1，虚线stride为2
        self.bn2 = nn.BatchNorm2d(out_channel)
        # -----------------------------------------
        self.conv3 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel*self.expansion,    # 卷积核个数为4倍
                               kernel_size=1, stride=1, bias=False)  # unsqueeze channels
        self.bn3 = nn.BatchNorm2d(out_channel*self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

# 正向传播过程
    def forward(self, x):
        identity = x
        if self.downsample is not None:   # 对应虚线残差结构
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += identity
        out = self.relu(out)

        return out

expansion参数的含义：残差结构中主分支的卷积核个数有没有发生变化

对于浅层来说，每个残差结构的第一层和第二层卷积核个数是一样的，故=1
对于深层来说，残差结构主分支上的三个卷积层所采用的卷积核个数不同，第一层、第二层一样，第三层是它们的四倍，故=4

downsample参数的含义：下采样参数，默认为None。对应虚线残差结构中，捷径的1×1的卷积层

对于conv3/4/5_x所对应的一系列残差结构中，第一层都是虚线残差结构，因为每一层的第一个残差结构有一个降维的作用

4.2 定义ResNet网络框架

对于浅层18和34 layer，conv2_x第一层为实线残差结构；而对于深层50/101/152 layer，conv2_x第一层为虚线残差结构（仅调整深度，不调整高度和宽度，故stride=1，1×1卷积）
无论浅层or深层网络，conv3/4/5_x第一层都为虚线残差结构（既调整深度，也需将高度和宽度缩减为原来的一半，故stride=2，1×1卷积）
无论浅层or深层网络，conv2/3/4/5_x从第二层开始，全部都是实线残差结构

# ResNet整个网络的框架部分
class ResNet(nn.Module):

    def __init__(self,
                 block,   # 残差结构，Basicblock or Bottleneck
                 blocks_num,   # 列表参数，所使用残差结构的数目，如对ResNet-34来说即是[3,4,6,3]
                 num_classes=1000,   # 训练集的分类个数
                 include_top=True):   # 为了能在ResNet网络基础上搭建更加复杂的网络，默认为True
        super(ResNet, self).__init__()
        self.include_top = include_top   # 传入类变量

        self.in_channel = 64   # 通过max pooling之后所得到的特征矩阵的深度

        self.conv1 = nn.Conv2d(3, self.in_channel, kernel_size=7, stride=2,
                               padding=3, bias=False)   # 输入特征矩阵的深度为3（RGB图像），高和宽缩减为原来的一半
        self.bn1 = nn.BatchNorm2d(self.in_channel)
        self.relu = nn.ReLU(inplace=True)

        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)   # 高和宽缩减为原来的一半

        self.layer1 = self._make_layer(block, 64, blocks_num[0])   # 对应conv2_x
        self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)   # 对应conv3_x
        self.layer3 = self._make_layer(block, 256, blocks_num[2], stride=2)   # 对应conv4_x
        self.layer4 = self._make_layer(block, 512, blocks_num[3], stride=2)   # 对应conv5_x

        if self.include_top:   # 默认为True
            # 无论输入特征矩阵的高和宽是多少，通过自适应平均池化下采样层，所得到的高和宽都是1
            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))  # output size = (1, 1)
            self.fc = nn.Linear(512 * block.expansion, num_classes)   # num_classes为分类类别数

        for m in self.modules():   # 卷积层的初始化操作
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

    def _make_layer(self, block, channel, block_num, stride=1):   # stride默认为1
        # block即BasicBlock/Bottleneck
        # channel即残差结构中第一层卷积层所使用的卷积核的个数
        # block_num即该层一共包含了多少层残差结构
        downsample = None

        # 左：输出的高和宽相较于输入会缩小；右：输入channel数与输出channel数不相等
        # 两者都会使x和identity无法相加
        if stride != 1 or self.in_channel != channel * block.expansion:  # ResNet-18/34会直接跳过该if语句（对于layer1来说）
            # 对于ResNet-50/101/152：
            # conv2_x第一层也是虚线残差结构，但只调整特征矩阵深度，高宽不需调整
            # conv3/4/5_x第一层需要调整特征矩阵深度，且把高和宽缩减为原来的一半
            downsample = nn.Sequential(       # 下采样
                nn.Conv2d(self.in_channel, channel * block.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(channel * block.expansion))   # 将特征矩阵的深度翻4倍，高和宽不变（对于layer1来说）

        layers = []
        layers.append(block(self.in_channel,  # 输入特征矩阵深度，64
                            channel,  # 残差结构所对应主分支上的第一个卷积层的卷积核个数
                            downsample=downsample,
                            stride=stride))
        self.in_channel = channel * block.expansion

        for _ in range(1, block_num):   # 从第二层开始都是实线残差结构
            layers.append(block(self.in_channel,  # 对于浅层一直是64，对于深层已经是64*4=256了
                                channel))  # 残差结构主分支上的第一层卷积的卷积核个数
        
        # 通过非关键字参数的形式传入nn.Sequential
        return nn.Sequential(*layers)   # *加list或tuple，可以将其转换成非关键字参数，将刚刚所定义的一切层结构组合在一起并返回

# 正向传播过程
    def forward(self, x):
        x = self.conv1(x)   # 7×7卷积层
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)    # 3×3 max pool

        x = self.layer1(x)   # conv2_x所对应的一系列残差结构
        x = self.layer2(x)   # conv3_x所对应的一系列残差结构
        x = self.layer3(x)   # conv4_x所对应的一系列残差结构
        x = self.layer4(x)   # conv5_x所对应的一系列残差结构

        if self.include_top:
            x = self.avgpool(x)    # 平均池化下采样
            x = torch.flatten(x, 1)    
            x = self.fc(x)

        return x

blocks_num参数的含义：列表

channel参数的含义：第一层卷积核个数

4.3 定义不同深度的ResNet模型

def resnet34(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet34-333f7ec4.pth
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)


def resnet50(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet50-19c8e357.pth
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)


def resnet101(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet101-5d3b4d8f.pth
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes=num_classes, include_top=inclu

4.4 整个网络

import torch
from torch import Tensor
import torch.nn as nn
from .._internally_replaced_utils import load_state_dict_from_url
from typing import Type, Any, Callable, Union, List, Optional


__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
           'resnet152', 'resnext50_32x4d', 'resnext101_32x8d',
           'wide_resnet50_2', 'wide_resnet101_2']


model_urls = {
    'resnet18': 'https://download.pytorch.org/models/resnet18-f37072fd.pth',
    'resnet34': 'https://download.pytorch.org/models/resnet34-b627a593.pth',
    'resnet50': 'https://download.pytorch.org/models/resnet50-0676ba61.pth',
    'resnet101': 'https://download.pytorch.org/models/resnet101-63fe2227.pth',
    'resnet152': 'https://download.pytorch.org/models/resnet152-394f9c45.pth',
    'resnext50_32x4d': 'https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth',
    'resnext101_32x8d': 'https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth',
    'wide_resnet50_2': 'https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth',
    'wide_resnet101_2': 'https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth',
}


def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d:
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)


class BasicBlock(nn.Module):
    expansion: int = 1

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion: int = 4

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Module):

    def __init__(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        layers: List[int],
        num_classes: int = 1000,
        zero_init_residual: bool = False,
        groups: int = 1,
        width_per_group: int = 64,
        replace_stride_with_dilation: Optional[List[bool]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)  # type: ignore[arg-type]
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)  # type: ignore[arg-type]

    def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int,
                    stride: int = 1, dilate: bool = False) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def _resnet(
    arch: str,
    block: Type[Union[BasicBlock, Bottleneck]],
    layers: List[int],
    pretrained: bool,
    progress: bool,
    **kwargs: Any
) -> ResNet:
    model = ResNet(block, layers, **kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls[arch],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model


def resnet18(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-18 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
                   **kwargs)


def resnet34(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-34 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet34', BasicBlock, [3, 4, 6, 3], pretrained, progress,
                   **kwargs)


def resnet50(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-50 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, progress,
                   **kwargs)


def resnet101(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-101 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet101', Bottleneck, [3, 4, 23, 3], pretrained, progress,
                   **kwargs)


def resnet152(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-152 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet152', Bottleneck, [3, 8, 36, 3], pretrained, progress,
                   **kwargs)


def resnext50_32x4d(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNeXt-50 32x4d model from
    `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 4
    return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3],
                   pretrained, progress, **kwargs)


def resnext101_32x8d(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNeXt-101 32x8d model from
    `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 8
    return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3],
                   pretrained, progress, **kwargs)


def wide_resnet50_2(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""Wide ResNet-50-2 model from
    `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.

    The model is the same as ResNet except for the bottleneck number of channels
    which is twice larger in every block. The number of channels in outer 1x1
    convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
    channels, and in Wide ResNet-50-2 has 2048-1024-2048.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['width_per_group'] = 64 * 2
    return _resnet('wide_resnet50_2', Bottleneck, [3, 4, 6, 3],
                   pretrained, progress, **kwargs)


def wide_resnet101_2(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""Wide ResNet-101-2 model from
    `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.

    The model is the same as ResNet except for the bottleneck number of channels
    which is twice larger in every block. The number of channels in outer 1x1
    convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
    channels, and in Wide ResNet-50-2 has 2048-1024-2048.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['width_per_group'] = 64 * 2
    return _resnet('wide_resnet101_2', Bottleneck, [3, 4, 23, 3],
                   pretrained, progress, **kwargs)