Normalization Layers

RenjieW

已于 2023-08-15 15:18:38 修改

阅读量339

点赞数 2

分类专栏： Normalizations 文章标签：深度学习 pytorch 计算机视觉

于 2023-07-30 13:37:34 首次发布

本文链接：https://blog.csdn.net/weixin_42103546/article/details/132001166

版权

Normalizations 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Normalization Layers

1. BatchNorm
- 1.1 数学公式
- 1.2 代码实现
2. LayerNorm
- 2.1 数学公式
- 2.2 代码实现
3. InstanceNorm
- 3.1 数学公式
- 3.2 代码实现
4. GroupNorm
- 4.1 数学公式
- 4.2 代码实现

1. BatchNorm

Batch Normalization是为了解决深度神经网络中的内部协变量偏移问题。在深度神经网络中，每一层的输入都是前一层的输出，而每一层的参数都是在训练过程中学习得到的。这意味着每一层的输入分布都会随着参数的更新而发生变化，导致每一层需要重新适应新的输入分布，这种现象称为内部协变量偏移。

1.1 数学公式

BatchNorm只对除了channel之外的维度计算均值和方差，以shape为(B, C, H, W)的图片数据为例：

$BN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{B,H,W}\lbrack X\rbrack}{\sqrt{Var_{B,H,W}\lbrack X\rbrack} + \epsilon} + \beta$

where

$\mathbb{E}_{B,H,W}\lbrack X\rbrack = X.view(B,C,H*W).mean(dim=[0,-1])$

$Var_{B,H,W}\lbrack X\rbrack = \mathbb{E}_{B,H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{B,H,W}^{2}\lbrack X\rbrack$

$\gamma$ , $\beta$ 是可学习的参数; $\epsilon$ 是很小的数，防止Zero Divide Error
指数移动平均(EMA)更新期望和方差：训练时期望和方差就是当前batch数据的期望和方差，推理时的均值和方差使用的是通过EMA更新的期望和方差

1.2 代码实现

以图片数据为例，BatchNorm的Python代码如下：

import torch
from torch import nn

class BatchNorm(nn.Module):
    def __init__(self, channels: int, eps: float = 1e-5, momentum: float = 0.1):
        super().__init__()
        self.channels = channels
        self.eps = eps
        self.momentum = momentum

        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

        self.register_buffer('exp_mean', torch.zeros(channels))
        self.register_buffer('exp_var', torch.ones(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels
        x = x.view(batch_size, channels, -1)  # shape: (batch_size, channels, height * width)

        if self.training:  # train mode
            mean = x.mean(dim=[0, 2])  # shape: (channels,)
            mean_x_square = (x ** 2).mean(dim=[0, 2])  # shape: (channels,)
            var = mean_x_square - mean ** 2  # shape: (channels,)
            self.exp_mean = (1 - self.momentum) * self.exp_mean + self.momentum * mean  # shape: (channels,)
            self.exp_var = (1 - self.momentum) * self.exp_var + self.momentum * var  # shape: (channels,)
        else:  # eval mode
            mean = self.exp_mean  # shape: (channels,)
            var = self.exp_var  # shape: (channels,)

        x_norm = (x - mean.view(1, -1, 1)) / torch.sqrt(var + self.eps).view(1, -1, 1)  # shape:  (batch_size, channels, height * width)
        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape:  (batch_size, channels, height * width)

        return x_norm.view(batch_size, channels, height, width)

2. LayerNorm

Layer Normalization是为了解决深度神经网络中的外部协变量偏移问题。在深度神经网络中，每一层的输入不仅依赖于前一层的输出，还依赖于整个数据集的分布情况，而数据集的分布情况通常是随机的。这意味着每一层的输入分布都会随着数据集的变化而发生变化，导致每一层需要重新适应新的输入分布，这种现象称为外部协变量偏移。

2.1 数学公式

LayerNorm只对除了bacth之外的维度计算均值和方差，以shape为(B, C, H, W)的图片数据为例：

LayerNorm示意图
$LN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{C,H,W}\lbrack X\rbrack}{\sqrt{Var_{C,H,W}\lbrack X\rbrack} + \epsilon} + \beta$

where:

$\mathbb{E}_{C,H,W}\lbrack X\rbrack = X.view(B,C,H, W).mean(dim=[1,2,3])$

$Var_{C,H,W}\lbrack X\rbrack = \mathbb{E}_{C,H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{C,H,W}^{2}\lbrack X\rbrack$

$\gamma$ , $\beta$ 是可学习的参数; $\epsilon$ 是很小的数，防止Zero Divide Error。

2.2 代码实现

以图片数据为例，LayerNorm的Python代码实现如下：

import torch
from torch import nn

class LayerNorm(nn.Module):
    def __init__(self, channels, height, width, eps = 1e-5):
        super().__init__()

        self.channels = channels
        self.height = height
        self.width = width
        self.eps = eps

        self.gamma = nn.Parameter(torch.ones(channels, height, width))
        self.beta = nn.Parameter(torch.zeros(channels, height, width))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """

        batch_size, channels, height, width = x.shape
        assert self.channels == channels and self.height == height and self.width == width

        mean = x.mean(dim=[1, 2, 3], keepdim=True)  # shape (batch_size, 1, 1, 1)
        mean_x_square = (x ** 2).mean(dim=[1, 2, 3], keepdim=True)  # shape (batch_size, 1, 1, 1)
        var = mean_x_square - mean ** 2
    
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        x_norm = self.gamma * x_norm + self.beta

        return x_norm

3. InstanceNorm

Instance Normalization是为了对单个样本进行归一化，从而使得模型更适合于图像风格转换等任务。在图像风格转换等任务中，输入图像的尺寸和内容都是随机的，因此无法使用 BatchNorm 或 LayerNorm 对数据进行归一化。相比之下，InstanceNorm 可以对单个样本进行归一化，从而适应不同尺寸和内容的输入图像。

3.1 数学公式

InstanceNorm只对除了batch和channel维度之外的维度求均值和方差，以shape为(B, C, H, W)的图片数据为例：

$IN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{H,W}\lbrack X\rbrack}{\sqrt{Var_{H,W}\lbrack X\rbrack} + \epsilon} + \beta$

where:

$\mathbb{E}_{H,W}\lbrack X\rbrack = X.view(B,C,H*W).mean(dim=[-1])$

$Var_{H,W}\lbrack X\rbrack = \mathbb{E}_{H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{H,W}^{2}\lbrack X\rbrack$

$\gamma$ , $\beta$ 是可学习的参数; $\epsilon$ 是很小的数，防止Zero Divide Error。

3.2 代码实现

以图片数据为例，InstanceNorm的Python代码实现如下：

import torch
from torch import nn

class InstanceNorm(nn.Module):
    def __init__(self, channels: int, eps: float = 1e-5):
        super().__init__()

        self.channels = channels

        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images, shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels

        x = x.view(batch_size, self.channels, -1)  # shape: (batch_size, channels, height * width)

        mean = x.mean(dim=[-1], keepdim=True)  # shape: (batch_size, channels)
        mean_x_square = (x ** 2).mean(dim=[-1], keepdim=True)  # shape: (batch_size, channels)
        var = mean_x_square - mean ** 2  # shape: (batch_size, channels)

        x_norm = (x - mean) / torch.sqrt(var + self.eps)  # shape: (batch_size, channels, height * width)
        x_norm = x_norm.view(batch_size, self.channels, -1)  # shape: (batch_size, channels, height * width)

        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape: (batch_size, channels, height * width)

        return x_norm.view(batch_size, channels, height, width)

4. GroupNorm

Group Normalization是为了解决 BatchNorm 在小批量数据上效果不佳的问题。在 BatchNorm 中，每个特征通道都被归一化，这意味着每个 mini-batch 中的所有样本在同一个通道维度上共享同一个均值和方差，从而减少了内部协变量偏移问题的影响。然而，在小批量数据上，每个 mini-batch 的样本数量很少，导致 BatchNorm 的均值和方差估计不准确，从而影响模型的性能。

4.1 数学公式

GroupNorm只对除了batch和group维度之外的维度求均值和方差，以shape为(B, C, H, W)的图片数据为例：

$GN\lparen X \rparen = \gamma \frac{X - \mathbb{E}_{\frac{C}{G},H,W}\lbrack X\rbrack}{\sqrt{Var_{\frac{C}{G},H,W}\lbrack X\rbrack} + \epsilon} + \beta$

where

$\mathbb{E}_{\frac{C}{G},H,W}\lbrack X\rbrack = X.view(B,G,\frac{C}{G}*H*W).mean(dim=[-1])$

$Var_{\frac{C}{G},H,W}\lbrack X\rbrack = \mathbb{E}_{\frac{C}{G},H,W}\lbrack X^{2}\rbrack - \mathbb{E}_{\frac{C}{G},H,W}^{2}\lbrack X\rbrack$

$\gamma$ , $\beta$ 是可学习的参数; $\epsilon$ 是很小的数，防止Zero Divide Error。

4.2 代码实现

以图片数据为例，GroupNorm的Python代码实现如下：

import torch
from torch import nn

class GroupNorm(nn.Module):
    def __init__(self, groups: int, channels: int, eps: float = 1e-5):
        super().__init__()

        assert channels % groups == 0, "Number of channels should be evenly divisible by the number of groups"
        self.groups = groups
        self.channels = channels

        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(channels))
        self.beta = nn.Parameter(torch.zeros(channels))

    def forward(self, x: torch.Tensor):
        """
        Args:
            x (torch.Tensor): If x is batch of images with shape (batch_size, channels, height, width)

        Returns:
            torch.Tensor: shape (batch_size, channels, height, width)
        """
        batch_size, channels, height, width = x.shape
        assert self.channels == channels

        x = x.view(batch_size, self.groups, -1)  # shape: (batch_size, groups, channels // groups * height * width)

        mean = x.mean(dim=[-1], keepdim=True)  # shape: (batch_size, groups, 1)
        mean_x_square = (x ** 2).mean(dim=[-1], keepdim=True)  # shape: (batch_size, groups, 1)
        var = mean_x_square - mean ** 2  # shape: (batch_size, groups, 1)

        x_norm = (x - mean) / torch.sqrt(var + self.eps)  # shape: (batch_size, groups, channels // groups * height * width)
        x_norm = x_norm.view(batch_size, self.channels, -1)  # shape: (batch_size, groups, channels // groups * height * width)
        x_norm = self.gamma.view(1, -1, 1) * x_norm + self.beta.view(1, -1, 1)  # shape: (batch_size, groups, channels // groups * height * width)

        return x_norm.view(batch_size, channels, height, width)

RenjieW

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Normalization Layers

然而，在小批量数据上，每个 mini-batch 的样本数量很少，导致 BatchNorm 的均值和方差估计不准确，从而影响模型的性能。在深度神经网络中，每一层的输入不仅依赖于前一层的输出，还依赖于整个数据集的分布情况，而数据集的分布情况通常是随机的。在深度神经网络中，每一层的输入都是前一层的输出，而每一层的参数都是在训练过程中学习得到的。指数移动平均(EMA)更新期望和方差：训练时期望和方差就是当前batch数据的期望和方差，推理时的均值和方差使用的是通过EMA更新的期望和方差。
复制链接

扫一扫

专栏目录