Normalization Layers

1. BatchNorm

Batch Normalization是为了解决深度神经网络中的内部协变量偏移问题。在深度神经网络中,每一层的输入都是前一层的输出,而每一层的参数都是在训练过程中学习得到的。这意味着每一层的输入分布都会随着参数的更新而发生变化,导致每一层需要重新适应新的输入分布,这种现象称为内部协变量偏移

1.1 数学公式

BatchNorm只对除了channel之外的维度计算均值和方差,以shape为(B, C, H, W)的图片数据为例:

B N ( X ) = γ X − E B , H , W [ X ] V a r B , H , W [ X ] + ϵ + β BN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{B,H,W}\lbrack X\rbrack}{\sqrt{Var_{B,H,W}\lbrack X\rbrack} + \epsilon} + \beta BN(X)=γVarB,H,W[X] +ϵXEB,H,W[X]+β

where

E B , H , W [ X ] = X . v i e w ( B , C , H ∗ W ) . m e a n ( d i m = [ 0 , − 1 ] ) \mathbb{E}_{B,H,W}\lbrack X\rbrack = X.view(B,C,H*W).mean(dim=[0,-1]) EB,H,W[X]=X.view(B,C,HW).mean(dim=[0,1])

V a r B , H , W [ X ] = E B , H , W [ X 2 ] − E B , H , W 2 [ X ] Var_{B,H,W}\lbrack X\rbrack = \mathbb{E}_{B,H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{B,H,W}^{2}\lbrack X\rbrack VarB,H,W[X]=EB,H,W[X2]EB,H,W2[X]

γ \gamma γ , β \beta β是可学习的参数; ϵ \epsilon ϵ是很小的数,防止Zero Divide Error
指数移动平均(EMA)更新期望和方差:训练时期望和方差就是当前batch数据的期望和方差,推理时的均值和方差使用的是通过EMA更新的期望和方差

1.2 代码实现

以图片数据为例,BatchNorm的Python代码如下:

import torch
from torch import nn

class BatchNorm(nn.Module):
    def __init__(self, channels: int, eps: float = 1e-5, momentum: float = 0.1):
        super().__init__()
        self.channels = channels
        self.eps = eps
        self.momentum = momentum

        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

        self.register_buffer('exp_mean', torch.zeros(channels))
        self.register_buffer('exp_var', torch.ones(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels
        x = x.view(batch_size, channels, -1)  # shape: (batch_size, channels, height * width)

        if self.training:  # train mode
            mean = x.mean(dim=[0, 2])  # shape: (channels,)
            mean_x_square = (x ** 2).mean(dim=[0, 2])  # shape: (channels,)
            var = mean_x_square - mean ** 2  # shape: (channels,)
            self.exp_mean = (1 - self.momentum) * self.exp_mean + self.momentum * mean  # shape: (channels,)
            self.exp_var = (1 - self.momentum) * self.exp_var + self.momentum * var  # shape: (channels,)
        else:  # eval mode
            mean = self.exp_mean  # shape: (channels,)
            var = self.exp_var  # shape: (channels,)

        x_norm = (x - mean.view(1, -1, 1)) / torch.sqrt(var + self.eps).view(1, -1, 1)  # shape:  (batch_size, channels, height * width)
        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape:  (batch_size, channels, height * width)

        return x_norm.view(batch_size, channels, height, width)

2. LayerNorm

Layer Normalization是为了解决深度神经网络中的外部协变量偏移问题。在深度神经网络中,每一层的输入不仅依赖于前一层的输出,还依赖于整个数据集的分布情况,而数据集的分布情况通常是随机的。这意味着每一层的输入分布都会随着数据集的变化而发生变化,导致每一层需要重新适应新的输入分布,这种现象称为外部协变量偏移

2.1 数学公式

LayerNorm只对除了bacth之外的维度计算均值和方差,以shape为(B, C, H, W)的图片数据为例:

LayerNorm示意图
L N ( X ) = γ X − E C , H , W [ X ] V a r C , H , W [ X ] + ϵ + β LN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{C,H,W}\lbrack X\rbrack}{\sqrt{Var_{C,H,W}\lbrack X\rbrack} + \epsilon} + \beta LN(X)=γVarC,H,W[X] +ϵXEC,H,W[X]+β

where:

E C , H , W [ X ] = X . v i e w ( B , C , H , W ) . m e a n ( d i m = [ 1 , 2 , 3 ] ) \mathbb{E}_{C,H,W}\lbrack X\rbrack = X.view(B,C,H, W).mean(dim=[1,2,3]) EC,H,W[X]=X.view(B,C,H,W).mean(dim=[1,2,3])

V a r C , H , W [ X ] = E C , H , W [ X 2 ] − E C , H , W 2 [ X ] Var_{C,H,W}\lbrack X\rbrack = \mathbb{E}_{C,H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{C,H,W}^{2}\lbrack X\rbrack VarC,H,W[X]=EC,H,W[X2]EC,H,W2[X]

γ \gamma γ , β \beta β是可学习的参数; ϵ \epsilon ϵ是很小的数,防止Zero Divide Error。

2.2 代码实现

以图片数据为例,LayerNorm的Python代码实现如下:

import torch
from torch import nn

class LayerNorm(nn.Module):
    def __init__(self, channels, height, width, eps = 1e-5):
        super().__init__()

        self.channels = channels
        self.height = height
        self.width = width
        self.eps = eps

        self.gamma = nn.Parameter(torch.ones(channels, height, width))
        self.beta = nn.Parameter(torch.zeros(channels, height, width))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """

        batch_size, channels, height, width = x.shape
        assert self.channels == channels and self.height == height and self.width == width

        mean = x.mean(dim=[1, 2, 3], keepdim=True)  # shape (batch_size, 1, 1, 1)
        mean_x_square = (x ** 2).mean(dim=[1, 2, 3], keepdim=True)  # shape (batch_size, 1, 1, 1)
        var = mean_x_square - mean ** 2
    
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        x_norm = self.gamma * x_norm + self.beta

        return x_norm

3. InstanceNorm

Instance Normalization是为了对单个样本进行归一化,从而使得模型更适合于图像风格转换等任务。在图像风格转换等任务中,输入图像的尺寸和内容都是随机的,因此无法使用 BatchNorm 或 LayerNorm 对数据进行归一化。相比之下,InstanceNorm 可以对单个样本进行归一化,从而适应不同尺寸和内容的输入图像

3.1 数学公式

InstanceNorm只对除了batch和channel维度之外的维度求均值和方差,以shape为(B, C, H, W)的图片数据为例:

I N ( X ) = γ X − E H , W [ X ] V a r H , W [ X ] + ϵ + β IN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{H,W}\lbrack X\rbrack}{\sqrt{Var_{H,W}\lbrack X\rbrack} + \epsilon} + \beta IN(X)=γVarH,W[X] +ϵXEH,W[X]+β

where:

E H , W [ X ] = X . v i e w ( B , C , H ∗ W ) . m e a n ( d i m = [ − 1 ] ) \mathbb{E}_{H,W}\lbrack X\rbrack = X.view(B,C,H*W).mean(dim=[-1]) EH,W[X]=X.view(B,C,HW).mean(dim=[1])

V a r H , W [ X ] = E H , W [ X 2 ] − E H , W 2 [ X ] Var_{H,W}\lbrack X\rbrack = \mathbb{E}_{H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{H,W}^{2}\lbrack X\rbrack VarH,W[X]=EH,W[X2]EH,W2[X]

γ \gamma γ , β \beta β是可学习的参数; ϵ \epsilon ϵ是很小的数,防止Zero Divide Error。

3.2 代码实现

以图片数据为例,InstanceNorm的Python代码实现如下:

import torch
from torch import nn

class InstanceNorm(nn.Module):
    def __init__(self, channels: int, eps: float = 1e-5):
        super().__init__()

        self.channels = channels

        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels

        x = x.view(batch_size, self.channels, -1)  # shape: (batch_size, channels, height * width)

        mean = x.mean(dim=[-1], keepdim=True)  # shape: (batch_size, channels)
        mean_x_square = (x ** 2).mean(dim=[-1], keepdim=True)  # shape: (batch_size, channels)
        var = mean_x_square - mean ** 2  # shape: (batch_size, channels)

        x_norm = (x - mean) / torch.sqrt(var + self.eps)  # shape: (batch_size, channels, height * width)
        x_norm = x_norm.view(batch_size, self.channels, -1)  # shape: (batch_size, channels, height * width)

        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape: (batch_size, channels, height * width)

        return x_norm.view(batch_size, channels, height, width)

4. GroupNorm

Group Normalization是为了解决 BatchNorm 在小批量数据上效果不佳的问题。在 BatchNorm 中,每个特征通道都被归一化,这意味着每个 mini-batch 中的所有样本在同一个通道维度上共享同一个均值和方差,从而减少了内部协变量偏移问题的影响。然而,在小批量数据上,每个 mini-batch 的样本数量很少,导致 BatchNorm 的均值和方差估计不准确,从而影响模型的性能。

4.1 数学公式

GroupNorm只对除了batch和group维度之外的维度求均值和方差,以shape为(B, C, H, W)的图片数据为例:

G N ( X ) = γ X − E C G , H , W [ X ] V a r C G , H , W [ X ] + ϵ + β GN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{\frac{C}{G},H,W}\lbrack X\rbrack}{\sqrt{Var_{\frac{C}{G},H,W}\lbrack X\rbrack} + \epsilon} + \beta GN(X)=γVarGC,H,W[X] +ϵXEGC,H,W[X]+β

where

E C G , H , W [ X ] = X . v i e w ( B , G , C G ∗ H ∗ W ) . m e a n ( d i m = [ − 1 ] ) \mathbb{E}_{\frac{C}{G},H,W}\lbrack X\rbrack = X.view(B,G,\frac{C}{G}*H*W).mean(dim=[-1]) EGC,H,W[X]=X.view(B,G,GCHW).mean(dim=[1])

V a r C G , H , W [ X ] = E C G , H , W [ X 2 ] − E C G , H , W 2 [ X ] Var_{\frac{C}{G},H,W}\lbrack X\rbrack = \mathbb{E}_{\frac{C}{G},H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{\frac{C}{G},H,W}^{2}\lbrack X\rbrack VarGC,H,W[X]=EGC,H,W[X2]EGC,H,W2[X]

γ \gamma γ , β \beta β是可学习的参数; ϵ \epsilon ϵ是很小的数,防止Zero Divide Error。

4.2 代码实现

以图片数据为例,GroupNorm的Python代码实现如下:

import torch
from torch import nn

class GroupNorm(nn.Module):
    def __init__(self, groups: int, channels: int, eps: float = 1e-5):
        super().__init__()

        assert channels % groups == 0, "Number of channels should be evenly divisible by the number of groups"
        self.groups = groups
        self.channels = channels

        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images with shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels

        x = x.view(batch_size, self.groups, -1)  # shape: (batch_size, groups, channels // groups * height * width)

        mean = x.mean(dim=[-1], keepdim=True)  # shape: (batch_size, groups, 1)
        mean_x_square = (x ** 2).mean(dim=[-1], keepdim=True)  # shape: (batch_size, groups, 1)
        var = mean_x_square - mean ** 2  # shape: (batch_size, groups, 1)

        x_norm = (x - mean) / torch.sqrt(var + self.eps)  # shape: (batch_size, groups, channels // groups * height * width)
        x_norm = x_norm.view(batch_size, self.channels, -1)  # shape: (batch_size, groups, channels // groups * height * width)
        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape: (batch_size, groups, channels // groups * height * width)

        return x_norm.view(batch_size, channels, height, width)
  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值