参考:
[1] LIU Z, MAO H, WU C Y, et al. A ConvNet for the 2020s[C/OL]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA. 2022. http://dx.doi.org/10.1109/cvpr52688.2022.01167. DOI:10.1109/cvpr52688.2022.01167.
[2] 薛导ConvNext博客
[3] DropPath理解
[4] 官方源码Facebook
TOC
1 模型学习
1.0 训练策略
作者以训练Vit的策略用来训练ResNet50,发现比原来要好,以此为基准。
1.1 Macro design
改变stage的计算比例
- 原ResNet50的stage的重复次数为[3,4,6,3],而Swin-T的重复次数比例为1:1:3:1, Swin-L的重复次数比例为1:1:9:1,可见第三层stage的重复次数更多。
- ConvNext将由原来的[3,4,6,4] 魔改为 [3,3,9,3]
改变stem(既开始的部分)
- 原ResNet的stem为卷积7x7,stride为2,padding为3,再经过stride为2的Maxpooling层,将原图像的高宽缩小4倍(224->56)。但在Swim-T中,卷积之间是没有重叠的,既stride=kernel_size
- ConvNext仿照Swin-Transformer,用4x4Conv替代原来的7x7Conv,并将步长设置为4,不使用padding,这样直接将下采样四倍了。
1.2 ResNeXt-ify
替换GroupConv为DW Conv
- 将组卷积替换为depthwise卷积(DW卷积),也就是将groups设置为channel,DW卷积最早出现于MobileNet中,也是GroupConv的一种特殊形式(groups = input channels)
- 然后使用1x1卷积去改变channel数(官方代码中说是使用pointwise 1x1 conv,并且就是一个
nn.Linear
)
改变conv的深度
- 原ResNet论文中,卷积的深度为(64,128,256,512), ConvNext将其改为(96,192,384,768)
1.3 Inverted bottleneck
- 原来ResNet结构的Block都是呈现宽-窄-宽结构,在ConvNext中,变成窄-宽-窄结构,如图的(a)->(b)
1.4 Large kernel size
将DW Conv移到第一层
- 为了确保efficiency,large-kernel conv通常有较少的channels,而1x1 conv反而会去做繁杂的事情(比如升维、降维)。故ConvNext将DW Conv移动到第一层,并保持维度不变,而第二、三层的1x1 Conv负责升维降维。
增大kernel size
- ConvNext将DW Conv的3x3 Conv变成7x7 Conv
1.5 Micro design
- 将ReLU换成GeLU,GELU可以看作是RELU的smooth版本
- 减小激活函数的使用,由Swim可知,只在最后一个1x1conv之前(降维之前)使用激活函数
- 在第二个1x1 Conv之前使用BN层,减小BN层的使用
- 将BN层换成LN层,BN模块有很多复杂的有害的影响
- 用2x2 Conv with stride 2替换原来的 1x1 Conv with stride 2进行残差下采样,也就是不使用残差下采样了,而是用 identity残差连接+2x2Conv下采样 替代
1.6 框架图
2 ConvNext代码实现
2.1 DropPath
由 DropPath理解可知,droppath是随机失活样本中的一部分,而dropout是随机失活样本的一些权重。需要注意的是,Droppath只在training phase中使用,并且放在block的残差连接之前。
下图分别为DropPath和Dropout的输出。
具体实现
def droppath(x,drop_prob:float=0.,training:bool=False):
if drop_prob == 0. or not training:
return x
keep_prob = 1. - drop_prob
shape = (x.shape[0],) + (1,)*(x.dim -1)
# 举个例子:
# 如果x为[10,3,224,224],那么shape为[10,1,1,1],只保留第0维,扩展后3维
random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
# random_tensor介于[0,2)之间
random_tensor = random_tensor.floor_()
# random_tensor为0或1,表示失活或保留
x = x.div(keep_prob)*randoom_tensor
# 保持期望值不变,参考网址:https://www.cnblogs.com/dan-baishucaizi/p/14703263.html
官方源码调用库函数
from timm.models.layers import DropPath
...
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
2.2 GELU
参考:https://alaaalatif.github.io/2019-04-11-gelu/
总的来说,GELU要比RELU的训练效果要好,并且GELU具有导数连续,比RELU更加平滑等优点,并且在0以下的值有一定概率不会变为0,避免了梯度消失。由下图可知,GELU安插在1x1conv升维之前。
代码:
self.gelu = nn.GELU()
2.3 LayerNorm
LayerNorm就是对每一个channel独立地进行求均值和方差的操作,然后再去对每一个channel独立地进行标准化。现在分channel维在第二位和第四位的情况,而pytorch官方可直接调用LayerNorm的情况是channel维在第四位的情况。
代码实现
class LayerNorm(nn.Module):
def __init__(self,normalized_shape,eps=1e-6,data_format='channels_last')
self.eps = eps
self.weight = nn.Parameter(torch.ones(normalized_shape)) # learnable gamma
self.bias = nn.Parameter(torch.zeros(normalized_shape)) # learnable beta
self.data_format = data_format
if self.data_format not in ['channels_first','channels_last']:
raise NotImplementedError
self.normalized_shape = (normalized_shape,)
def forward(self,x):
if self.data_format == 'channels_first':
return F.layer_norm(x,self.normalized_shape,self.weight,self.bias,self.eps)
elif self.data_format == 'channels_last':
u = x.mean(1,keepdim=True)
s = (x-u).pow(2).mean(1,keepdim=True) # variance
x = (x-u)/torch.sqrt(s+self.eps)
x = self.weight[:,None,None]*x+self.bias[:,None,None]
return x
2.4 Block的实现
Block的源码中有一个layer_scale的操作是原论文中没有提及的,源于改论文Going deeper with image transformers. ICCV, 2021
, 简单来说就是每一个通道的值进行缩放,缩放的因子gamma是一个可学习参数。
官方源码中开头的注释:
ConvNeXt Block. There are two equivalent implementations:
(1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
(2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
We use (2) as we find it slightly faster in PyTorch
Args:
dim (int): Number of input channels.
drop_path (float): Stochastic depth rate. Default: 0.0
layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
代码实现:
class Block(nn.Module):
def __init__(self,dim,drop_path:float=0.,layer_scale_init_value:float=1e-6):
self.dwconv = nn.Conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)
self.norm = LayerNorm(dim,eps=1e-6)
self.pwconv1 = nn.Linear(dim,dim*4)
self.gelu = nn.GELU()
self.pwconv2 = nn.Linear(dim*4,dim)
self.gamma = nn.Parameter(layer_scale_init_value* torch.ones((dim)), require_grad=True) if layer_scale_init_value > 0 else None
self.Droppath = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self,x):
input_ = x
x = self.dwconv(x)
x = x.permute(0,2,3,1) # put channel dim into last dim
x = self.norm(x)
x = self.pwconv1(x)
x = self.gelu(x)
x = self.pwconv2(x)
if self.gamma is not None:
x *= self.gamma
x = x.permute(0,3,1,2)
x = input_ + self.drop_path(x)
return x
2.5 ConvNext网络搭建
2.5.1 特征提取部分
- stem为 [
conv(k=4,s=4) + Layer Norm
] - downsample为 [
Layer Norm + conv(dim1,dim2,k=2,s=2)
] - stage部分为 [ 多个block ]
2.5.2 分类头部分
- 全局平均: 相当于平均每一维,既[B,C,H,W] - > [B,C]
- Layer Norm + Linear不介绍了
2.5.3 初始化
需要初始化的层包括Linear和Conv,分别初始化其weight和bias
- 分类头Linear: weight和bias都乘以1(保持不变),且这是一个in-place操作,意味着它会直接修改self.head.bias的值,而不是创建一个新的tensor。
- 其他Linear和conv:关于weight使用 truncated normal distribution(截断正态分布),关于bias使用常数constant为0
网络代码:
class ConvNext(nn.Module):
def __init__(self, in_chans=3, num_classes=1000, depths=[3,3,9,3], dims=[96,192,384,768],
drop_path_rate=0.,layer_scale_init_value=1e-6,head_init_scale=1.)
super().__init__()
# ------------------ 下采样部分 --------------------------------
self.downsample_layers = nn.ModuleList()
# 保存 stem和3个downsample_layer
stem = nn.Sequential(
nn.Conv2d(3,dims[0],kernel_size=4,stride=4),
LayerNorm(dims[0],eps=1e-6,data_format='channels_first')
)
for i in range(3):
downsample_layer = nn.Sequential(
LayerNorm(dim[i],eps=1e-6,data_format='channels_first'),
nn.Conv2d(dim[i],dim[i+1],kernel_size=2,stride=2)
)
self.downsample_layers.append(downsample_layer)
# ------------------- stage 部分 --------------------------------
self.stages = nn.ModuleList()
dp_rate = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths)]
cur = 0 # 表示当前在第几层(深度)
for i in range(4):
stage = nn.Sequential(
*[Block(dims[i], drop_path=dp_rate[cur+j],
layer_scale_init_value=layer_scale_init_value) for j in range(depths[i])
)
self.stages.append(stage)
cur += depths[i]
# ------------------ 分类头和初始化部分 ----------------------------
self.norm = LayerNorm(dim[-1],eps=1e-6)
self.head = nn.Linear(dim[-1],num_classes)
self.apply(self._init_weights)
self.head.weight.data.mul_(head_init_scale)
self.head.bias.data.mul_(head_init_scale)
def _init_weight(self,m):
if isinstance(m,(nn.Linear, nn.Conv2d)):
trunc_normal(m.weight, std=.02),
nn.init.constant_(m.bias,0)
def forward_features(self,x):
# features extraction part(stem, stage and downsample)
for i in range(4):
x = self.downsample_layers[i](x)
x = self.stages[i](x)
return self.norm(x.mean([-2,-1])) # GAP(global average pooling) [c,b,h,w] -> [c,b]
def forward(self,x):
x = self.forward_features(x)
x = self.head(x)
return x