ResNet
1 Resnet网络又叫残差网络,主要是为了解决网络越来越深带来的梯度消失\爆炸(现在可以通过BN解决),还有就是随着网络加深,训练集准确度下降(不是因为过拟合造成的),其主要结构如下图:
2 以Resnet50为例,每一个block,有两种形式,一种basic block,一种bottleneck block,第二种纯粹是为了减少网络参数,如下图
3 对于从conv2_x到conv3_x,在第一个block中需要用下采样,保证channel数相同进行相加
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(Bottleneck, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
self.conv2 = conv3x3(width, width, stride, groups, dilation)
self.bn2 = norm_layer(width)
self.conv3 = conv1x1(width, planes * self.expansion)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
一个block里面一个卷积对应一个bn+relu,对于最后一个卷积则是先bn,相加输入的x,然后在relu。
SENet
4 SENet(Squeeze-and-Excitation Networks),即并不是每个通道都有用,通过自我学习(注意力机制),给每个通道学习一个权重,通过对通道进行加权,强调有效信息,抑制无效信息,注意力机制,并且是一个通用方法。对于每一输出通道,先global average pool,每个通道得到1个标量,C个通道得到C个数,然后经过FC-ReLU-FC-Sigmoid得到C个0到1之间的标量,作为通道的权重,然后原来的输出通道每个通道用对应的权重进行加权(对应通道的每个元素与权重分别相乘),得到新的加权后的特征,作者称之为feature recalibration。(如下图缩减系数r为16)
假设resnet+se,对每一个block,最后一个卷积层+bn之后,直接relu,记为origin out,不先相加输入的x。对于origin out复制一份保留,然后对origin out结果进行全局平均池化,然后fc+relu+fc+sigmod,得到每个通道的权重(0-1之间)记为out,然后将out * origin out,之后再与输入的x相加,得到最终结果,代码如下:
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(Bottleneck, self).__init__()
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(planes * 4)
self.relu = nn.ReLU(inplace=True)
if planes == 64:
self.globalAvgPool = nn.AvgPool2d(56, stride=1)
elif planes == 128:
self.globalAvgPool = nn.AvgPool2d(28, stride=1)
elif planes == 256:
self.globalAvgPool = nn.AvgPool2d(14, stride=1)
elif planes == 512:
self.globalAvgPool = nn.AvgPool2d(7, stride=1)
self.fc1 = nn.Linear(in_features=planes * 4, out_features=round(planes / 4))
self.fc2 = nn.Linear(in_features=round(planes / 4), out_features=planes * 4)
self.sigmoid = nn.Sigmoid()
self.downsample = downsample
self.stride = stride
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
residual = self.downsample(x)
original_out = out
out = self.globalAvgPool(out)
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.relu(out)
out = self.fc2(out)
out = self.sigmoid(out)
out = out.view(out.size(0),out.size(1),1,1)
out = out * original_out
out += residual
out = self.relu(out)
return out
ResneXt
中文参考https://zhuanlan.zhihu.com/p/32913695,对应的pytorch官网代码https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
为什么起作用
参考https://www.zhihu.com/question/323424817/answer/1078704765
1 具体结构也是split-transform-merge范式,对channel进行拆分然后在合并,具体如图基本结构如下,也就是相当于多了32个group,每个group对应的channel是4,也就是相比原来64channel变成了128channel,但是采用group 方式卷积,所以参数并未增加多少,而channle数量增加,导致效果增加(这也是mobilenet的方式)
2 resneXt实际上也是group卷积,采用group,不同的组之间实际上是不同的subspace,而他们的确能学到更diverse的表示;其次分组操作或许起到正则化作用,因为每个组之间信息不交换,是学习到的关系更加稀疏,是过拟合风险也降低
对于该结构下面三种情况完全等价,所以代码中最终实现方式就是采用group 卷积实现,也就是图C方式实现
3 代码示例,根据官方给的resnext50_32x4d,其中一个bottleneck如下
# resnext50_32x4d
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=32, bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
# resnext101_32x8d
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=32, bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
# resnet101
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
会发现,相对于resnet结构,就是输出的feature channel数量, 32 ∗ 4 d 32*4d 32∗4d模型是2倍, 32 ∗ 8 d 32*8d 32∗8d模型是4倍,然后前面32是对应第二个卷积的group数量
Group 卷积
定义就是对输入的feature map进行分组,然后每组分别卷积,每组之间信息不互通。假设输入feature map 是
C
∗
H
∗
W
C*H*W
C∗H∗W,输出feature channel个数为
N
N
N个,如果设定分成
G
G
G个组,则相当于kennel大小是
C
G
∗
K
∗
K
\frac CG * K * K
GC∗K∗K,输出的channel数量是
N
G
\frac NG
GN,最终将这些group结果进行concat,得到最终channel数量
N
N
N,作用如下:
1 主要降参数量:parameters是
C
G
∗
K
∗
K
∗
N
\frac CG * K * K * N
GC∗K∗K∗N
2 相当于变相正则,减少过拟合风险
3 如果
G
=
C
=
N
G=C=N
G=C=N,这就是Depthwise 卷积
ResneSt
这是最新出的也是基于attention做的,对应的主要解读https://zhuanlan.zhihu.com/p/132655457,对应的代码https://github.com/zhanghang1989/ResNeSt,作者还提供了了基于detectron2的检测等代码https://github.com/zhanghang1989/detectron2-ResNeSt。本文主要借鉴了SENet,SKNet,ResneXt,最后对比代码会发现和SKNet很像,所以很多人感觉可能在学术上创新性不高,但是在工业上提供了一个很强的backbone,很有价值。
ResNeSt 和 SE-Net、SK-Net 的对应图示如下:
其中主要的参数是Cardinal和Split,代码里面分别对应的是groups和radix,但是代码里面发现所有的groups=1,radix=2,所以这就导致和SKNet很相似,然后有很多人包括作者自己的讨论,对应的https://github.com/zhanghang1989/ResNeSt/issues/4
3 添加一个Bottleneck的流程
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(avd_layer): AvgPool2d(kernel_size=3, stride=2, padding=1)
(conv2): SplAtConv2d(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2, bias=False)
(bn0): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(fc1): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1))
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(fc2): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
(rsoftmax): rSoftMax()
)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): AvgPool2d(kernel_size=2, stride=2, padding=0)
(1): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
这里和原始的resnet Bottleneck基本类似就是在conv2这一个卷积上有点区别,这里是用SplAtConv2d,对于上述模块流程是,输入
128
∗
56
∗
56
128*56*56
128∗56∗56的特征,先经过一个group=2的卷积升维到256(其实就是radix=2),得到的特征进行split,分成两个1285656的特征,保存下来;然后第一步进行特征相加得到
128
∗
56
∗
56
128*56*56
128∗56∗56,然后平均池化,经过几层fc,bn,relu之后得到
256
∗
1
∗
1
256*1*1
256∗1∗1,然后通过rsoftmax,得到权重,最后将这权重在split成2个128,分别与上一步保存的特征相乘然后在相加,得到最终经过注意力之后的特征,其实这一步主要还是channel 注意力机制
这里有2个细节:
1 就是在rsoftmax之前,对特征进行了转置
x = x.view(batch, self.cardinality, self.radix, -1).transpose(1, 2)
这里https://github.com/zhanghang1989/ResNeSt/issues/41,作者给了解释,主要是对于之前的特征是以radix为主,而在后面fc之后是以cardinality为主,所以在softmax之前,需要进行转置。当然对于SENet,权重计算是sigmoid,这里就是softmax计算得到权重
2 就是每层layer的特征都要减小一半,俩种结构不是太一样
# 原始Resnet下采样过程
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
# ResneSt下采样过程
(avd_layer): AvgPool2d(kernel_size=3, stride=2, padding=1)