[占坑] Equalized learning rate implementation

CarolSu2055

已于 2024-09-04 20:30:28 修改

阅读量1.2k

点赞数

文章标签：深度学习 pytorch 机器学习

于 2022-01-15 17:45:40 首次发布

本文链接：https://blog.csdn.net/qq_38712865/article/details/122502973

版权

本文介绍了Equalized Learning Rate（ELR）的概念和背景，旨在稳定训练深度学习模型。ELR避免了由于权重初始化导致的不同参数动态范围差异，影响优化器如Adam和RMSProp的更新步长。通过动态权重归一化，确保所有权重的学习速率一致。文章还提供了两种在PyTorch中实现ELR的方法，包括自定义EqualConv2d和EqualLinear层。

摘要由CSDN通过智能技术生成

前言

最近在看GPEN的代码，其中Generator的部分借鉴了StyleGAN2，并包含比较多tricks。本文主要挖一个关于「Equalized Learning Rate」的坑，后续再填。

ELR背景

ELR 源自PG-GAN，在StyleGAN系列中被沿用，目的是为了稳定训练。

关于其具体实现（在 linear layer 和 conv2d layer 中有应用），简单来说：

1）当初始化 layer 权重时，不再采用各种 fancy 的初始化方法，仅采用N(0, 1)分布随机初始化
2）~~在训练模型过程中，对 layer 权重进行归一化。其中，归一化系数 c 是通过 kaiming 初始化方法计算得到fan_in，具体的系数值会根据模型本身有所不同。~~ 当layer前向传播过程中，对layer权重进行缩放，缩放系数为c（即实现方式1中self.scale，参考PG-GAN 4.1节中per layer normalization constant from He’s initializer）

ELR出发点

依旧引用PG-GAN原文中的说法：

The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights. A similar reasoning was independently used by van Laarhoven (2017).

也就是说ELR影响优化器（如Adam, RMSProp）求解时梯度更新。Adam和RMSProp的共同点在于，计算步长时考虑了梯度的二阶矩估计（Second Moment Estimation，即梯度的未中心化的方差）。计算步长时，梯度方差作为分母可以理解为将步长统一到相同尺度下（大概是这么个意思），而如果不同参数的数值范围（上文中的dynamic range）差距较大，意味着数值范围较大的参数往往需要更多次的步长更新，而数值范围较小的参数当前步长可能过大。ELR通过将参数本身进行normalization，使得所有参数的数值范围是相近的，再使用Adam、RMSProp求解的时候，每个参数的更新步长相近，学习速率是相同的。

ELR原理

理解ELR的原理，先回顾Kaiming大神的权重初始化方法：

Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming 非常浅显易懂
kaiming初始化的推导理解思想后，从数学推导中进一步理解

即，通过Kaiming He initialization：

前向传播的时候，每一层的卷积计算结果的方差为1
反向传播的时候，每一层的继续往前传的梯度方差为1（因为每层会有两个梯度的计算，一个用来更新当前层的权重，一个继续传播, 用于前面层的梯度的计算）

ELR代码实现

实现方式1：

参考GPEN代码，分别自定义 EqualConv2d 以及 EqualLinear layer。

class EqualLinear(nn.Module):
    def __init__(
        self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1, activation=None, device='cpu'
    ):
        super().__init__()

        self.weight = nn.Parameter(torch.randn(out_dim, in_dim).div_(lr_mul))

        if bias:
            self.bias = nn.Parameter(torch.zeros(out_dim).fill_(bias_init))

        else:
            self.bias = None

        self.activation = activation
        self.device = device

        self.scale = (1 / math.sqrt(in_dim)) * lr_mul
        self.lr_mul = lr_mul

    def forward(self, input):
        if self.activation:
            out = F.linear(input, self.weight * self.scale)
            out = fused_leaky_relu(out, self.bias * self.lr_mul, device=self.device)

        else:
            out = F.linear(input, self.weight * self.scale, bias=self.bias * self.lr_mul)

        return out

实现方式2：

参考style-based-gan-pytorch，自定义类 EqualLR，能够 applied 于 nn.Linear 以及 nn.Conv2d。

class EqualLR:
  def __init__(self, name):
      self.name = name

  def compute_weight(self, module):
      weight = getattr(module, self.name + '_orig')
      fan_in = weight.data.size(1) * weight.data[0][0].numel()

      return weight * math.sqrt(2 / fan_in)

  @staticmethod
  def apply(module, name):
      fn = EqualLR(name)

      weight = getattr(module, name)    #1
      del module._parameters[name]
      module.register_parameter(name + '_orig', nn.Parameter(weight.data))
      module.register_forward_pre_hook(fn)    #2

      return fn

  def __call__(self, module, input):    #3
      weight = self.compute_weight(module)    #4
      setattr(module, self.name, weight)    #5