
6 正则化

6.1 weight decay 和 dropout


这篇文章主要介绍了正则化与偏差-方差分解,以及 PyTorch 中的 L2 正则项–weight decay


Regularization 中文是正则化,可以理解为一种减少方差的策略。





正则化方式有 L1 和 L2 正则项两种。其中 L2 正则项又被称为权值衰减(weight decay)。

当没有正则项时: O b j = L o s s \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s} Obj=Loss w ∗ i + 1 = w ∗ i − ∂ o b j ∂ w ∗ i = w ∗ i − ∂ L o s s ∂ w i w*{i+1}=w*{i}-\frac{\partial o b j}{\partial w*{i}}=w*{i}-\frac{\partial L o s s}{\partial w_{i}} wi+1=wiwiobj=wiwiLoss

当使用 L2 正则项时, O b j = L o s s + λ 2 ∗ ∑ i N w i 2 \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s}+\frac{\lambda}{2} *\sum{i}^{N} \boldsymbol{w}{i}^{2} Obj=Loss+2λiNwi2 w i + 1 = w i − ∂ o b j ∂ w i = w i − ( ∂ L o s s ∂ w i + λ ∗ w ∗ i ) = w ∗ i ( 1 − λ ) − ∂ L o s s ∂ w i \begin{aligned} w{i+1}=w{i}-\frac{\partial o b j}{\partial w{i}} &=w{i}-\left(\frac{\partial L o s s}{\partial w_{i}}+\lambda* w*{i}\right) =w*{i}(1-\lambda)-\frac{\partial L o s s}{\partial w_{i}} \end{aligned} wi+1=wiwiobj=wi(wiLoss+λwi)=wi(1λ)wiLoss,其中 0 < λ < 1 0 < \lambda < 1 0<λ<1,所以具有权值衰减的作用。

在 PyTorch 中,L2 正则项是在优化器中实现的,在构造优化器时可以传入 weight decay 参数,对应的是公式中的$\lambda $。

下面代码对比了没有 weight decay 的优化器和 weight decay 为 0.01 的优化器的训练情况,在线性回归的数据集上进行实验,模型使用 3 层的全连接网络,并使用 TensorBoard 可视化每层权值的变化情况。代码如下:

import torch

import torch.nn as nn

import matplotlib.pyplot as plt

from common_tools import set_seed

from tensorboardX import SummaryWriter

set_seed(1) # 设置随机种子

n_hidden = 200

max_iter = 2000

disp_interval = 200

lr_init = 0.01

# ============================ step 1/5 数据 ============================

def gen_data(num_data=10, x_range=(-1, 1)):

​ w = 1.5

​ train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)

​ train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())

​ test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)

​ test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

​ return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))

# ============================ step 2/5 模型 ============================

class MLP(nn.Module):

​ def init(self, neural_num):

​ super(MLP, self).init()

​ self.linears = nn.Sequential(

​ nn.Linear(1, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Linear(neural_num, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Linear(neural_num, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Linear(neural_num, 1),

​ )

​ def forward(self, x):

​ return self.linears(x)

net_normal = MLP(neural_num=n_hidden)

net_weight_decay = MLP(neural_num=n_hidden)

# ============================ step 3/5 优化器 ============================

optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)

optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)

# ============================ step 4/5 损失函数 ============================

loss_func = torch.nn.MSELoss()

# ============================ step 5/5 迭代训练 ============================

writer = SummaryWriter(comment=‘_test_tensorboard’, filename_suffix=“12345678”)

for epoch in range(max_iter):

​ # forward

​ pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)

​ loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

​ optim_normal.zero_grad()

​ optim_wdecay.zero_grad()

​ loss_normal.backward()

​ loss_wdecay.backward()

​ optim_normal.step()

​ optim_wdecay.step()

​ if (epoch+1) % disp_interval == 0:

​ # 可视化

​ for name, layer in net_normal.named_parameters():

​ writer.add_histogram(name + ‘_grad_normal’, layer.grad, epoch)

​ writer.add_histogram(name + ‘_data_normal’, layer, epoch)

​ for name, layer in net_weight_decay.named_parameters():

​ writer.add_histogram(name + ‘_grad_weight_decay’, layer.grad, epoch)

​ writer.add_histogram(name + ‘_data_weight_decay’, layer, epoch)

​ test_pred_normal, test_pred_wdecay = net_normal(test_x), net_weight_decay(test_x)

​ # 绘图

​ plt.scatter(,, c=‘blue’, s=50, alpha=0.3, label=‘train’)

​ plt.scatter(,, c=‘red’, s=50, alpha=0.3, label=‘test’)

​ plt.plot(,, ‘r-’, lw=3, label=‘no weight decay’)

​ plt.plot(,, ‘b–’, lw=3, label=‘weight decay’)

​ plt.text(-0.25, -1.5, ‘no weight decay loss={:.6f}’.format(loss_normal.item()), fontdict={‘size’: 15, ‘color’: ‘red’})

​ plt.text(-0.25, -2, ‘weight decay loss={:.6f}’.format(loss_wdecay.item()), fontdict={‘size’: 15, ‘color’: ‘red’})

​ plt.ylim((-2.5, 2.5))

​ plt.legend(loc=‘upper left’)

​ plt.title(“Epoch: {}”.format(epoch+1))


​ plt.close()

训练 2000 个 epoch 后,模型如下:

可以看到使用了 weight decay 的模型虽然在训练集的 loss 更高,但是更加平滑,泛化能力更强。

下面是使用 Tensorboard 可视化的分析。首先查看不带 weight decay 的权值变化过程,第一层权值变化如下:


然后查看带 weight decay 的权值变化过程,第一层权值变化如下:

可以看到,加上了 weight decay 后,随便训练次数的增加,权值的分布逐渐靠近 0 均值附近,这就是 L2 正则化的作用,约束权值尽量靠近 0。

第二层不带 weight decay 的权值变化如下:

第二层带 weight decay 的权值变化如下:


weight decay 在 优化器中的实现

由于 weight decay 在优化器的一个参数,因此在执行optim_wdecay.step()时,会计算 weight decay 后的梯度,具体代码如下:

​ def step(self, closure=None):

​ “”"Performs a single optimization step.

​ Arguments:

​ closure (callable, optional): A closure that reevaluates the model

​ and returns the loss.

​ “”"

​ loss = None

​ if closure is not None:

​ loss = closure()

​ for group in self.param_groups:

​ weight_decay = group[‘weight_decay’]

​ momentum = group[‘momentum’]

​ dampening = group[‘dampening’]

​ nesterov = group[‘nesterov’]

​ for p in group[‘params’]:

​ if p.grad is None:

​ continue

​ d_p =

​ if weight_decay != 0:

​ d_p.add_(weight_decay,

​ …

​ …

​ …

​[‘lr’], d_p)

可以看到:d*p 是计算得到的梯度,如果 weight decay 不为 0,那么更新 d p = d p + w e i g h t d e c a y × p . d a t a d_p=dp+weight_decay \times dp=dp+weightdecay×,对应公式: ( ∂ L o s s ∂ w ∗ i + λ ∗ w i ) \left(\frac{\partial L o s s}{\partial w*{i}}+\lambda * w_{i}\right) (wiLoss+λwi)。最后一行是根据梯度更新权值。


Dropout 是另一种抑制过拟合的方法。在使用 dropout 时,数据尺度会发生变化,如果设置 dropout_prob =0.3,那么在训练时,数据尺度会变为原来的 70%;而在测试时,执行了 model.eval() 后,dropout 是关闭的,因此所有权重需要乘以 (1-dropout_prob),把数据尺度也缩放到 70%。

PyTorch 中 Dropout 层如下,通常放在每个网路层的最前面:

torch.nn.Dropout(p=0.5, inplace=False)


p:主力需要注意的是,p 是被舍弃的概率,也叫失活概率

下面实验使用的依然是线性回归的例子,两个网络均是 3 层的全连接层,每层前面都设置 dropout,一个网络的 dropout 设置为 0,另一个网络的 dropout 设置为 0.5,并使用 TensorBoard 可视化每层权值的变化情况。代码如下:

import torch

import torch.nn as nn

import matplotlib.pyplot as plt

from common_tools import set_seed

from tensorboardX import SummaryWriter

set_seed(1) # 设置随机种子

n_hidden = 200

max_iter = 2000

disp_interval = 400

lr_init = 0.01

# ============================ step 1/5 数据 ============================

def gen_data(num_data=10, x_range=(-1, 1)):

​ w = 1.5

​ train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)

​ train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())

​ test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)

​ test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

​ return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))

# ============================ step 2/5 模型 ============================

class MLP(nn.Module):

​ def init(self, neural_num, d_prob=0.5):

​ super(MLP, self).init()

​ self.linears = nn.Sequential(

​ nn.Linear(1, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Dropout(d_prob),

​ nn.Linear(neural_num, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Dropout(d_prob),

​ nn.Linear(neural_num, neural_num),

​ nn.ReLU(inplace=True),

​ nn.Dropout(d_prob),

​ nn.Linear(neural_num, 1),

​ )

​ def forward(self, x):

​ return self.linears(x)

net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)

net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)

# ============================ step 3/5 优化器 ============================

optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9)

optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9)

# ============================ step 4/5 损失函数 ============================

loss_func = torch.nn.MSELoss()

# ============================ step 5/5 迭代训练 ============================

writer = SummaryWriter(comment=‘_test_tensorboard’, filename_suffix=“12345678”)

for epoch in range(max_iter):

​ pred_normal, pred_wdecay = net_prob_0(train_x), net_prob_05(train_x)

​ loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

​ optim_normal.zero_grad()

​ optim_reglar.zero_grad()

​ loss_normal.backward()

​ loss_wdecay.backward()

​ optim_normal.step()

​ optim_reglar.step()

​ if (epoch+1) % disp_interval == 0:

​ net_prob_0.eval()

​ net_prob_05.eval()

​ # 可视化

​ for name, layer in net_prob_0.named_parameters():

​ writer.add_histogram(name + ‘_grad_normal’, layer.grad, epoch)

​ writer.add_histogram(name + ‘_data_normal’, layer, epoch)

​ for name, layer in net_prob_05.named_parameters():

​ writer.add_histogram(name + ‘_grad_regularization’, layer.grad, epoch)

​ writer.add_histogram(name + ‘_data_regularization’, layer, epoch)

​ test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x)

​ # 绘图

​ plt.scatter(,, c=‘blue’, s=50, alpha=0.3, label=‘train’)

​ plt.scatter(,, c=‘red’, s=50, alpha=0.3, label=‘test’)

​ plt.plot(,, ‘r-’, lw=3, label=‘d_prob_0’)

​ plt.plot(,, ‘b–’, lw=3, label=‘d_prob_05’)

​ plt.text(-0.25, -1.5, ‘d_prob_0 loss={:.8f}’.format(loss_normal.item()), fontdict={‘size’: 15, ‘color’: ‘red’})

​ plt.text(-0.25, -2, ‘d_prob_05 loss={:.6f}’.format(loss_wdecay.item()), fontdict={‘size’: 15, ‘color’: ‘red’})

​ plt.ylim((-2.5, 2.5))

​ plt.legend(loc=‘upper left’)

​ plt.title(“Epoch: {}”.format(epoch+1))


​ plt.close()

​ net_prob_0.train()

​ net_prob_05.train()

训练 2000 次后,模型的曲线如下:

我们使用 TensorBoard 查看第三层网络的权值变化情况。

dropout =0 的权值变化如下:

dropout =0.5 的权值变化如下:

可以看到,加了 dropout 之后,权值更加集中在 0 附近,使得神经元之间的依赖性不至于过大。

model.eval() 和 model.trian()

有些网络层在训练状态和测试状态是不一样的,如 dropout 层,在训练时 dropout 层是有效的,但是数据尺度会缩放,为了保持数据尺度不变,所有的权重需要除以 1-p。而在测试时 dropout 层是关闭的。因此在测试时需要先调用model.eval()设置各个网络层的的training属性为 False,在训练时需要先调用model.train()设置各个网络层的的training属性为 True。

下面是对比 dropout 层的在 eval 和 train 模式下的输出值。

首先构造一层全连接网络,输入是 10000 个神经元,输出是 1 个神经元,权值全设为 1,dropout 设置为 0.5。输入是全为 1 的向量。分别测试网络在 train 模式和 eval 模式下的输出,代码如下:

import torch

import torch.nn as nn

class Net(nn.Module):

​ def init(self, neural_num, d_prob=0.5):

​ super(Net, self).init()

​ self.linears = nn.Sequential(

​ nn.Dropout(d_prob),

​ nn.Linear(neural_num, 1, bias=False),

​ nn.ReLU(inplace=True)

​ )

​ def forward(self, x):

​ return self.linears(x)

input_num = 10000

x = torch.ones((input_num, ), dtype=torch.float32)

net = Net(input_num, d_prob=0.5)



y = net(x)

print(“output in training mode”, y)


y = net(x)

print(“output in eval mode”, y)


output in training mode tensor([9868.], grad_fn=)

output in eval mode tensor([10000.], grad_fn=)

在训练时,由于 dropout 为 0.5,因此理论上输出值是 5000,而由于在训练时,dropout 层会把权值除以 1-p=0.5,也就是乘以 2,因此在 train 模式的输出是10000 附近的数(上下随机浮动是由于概率的不确定性引起的) 。而在 eval 模式下,关闭了 dropout,因此输出值是 10000。这种方式在训练时对权值进行缩放,在测试时就不用对权值进行缩放,加快了测试的速度。

6.2 Normalization


这篇文章主要介绍了 Batch Normalization 的概念,以及 PyTorch 中的 1d/2d/3d Batch Normalization 实现。

Batch Normalization

称为批标准化。批是指一批数据,通常为 mini-batch;标准化是处理后的数据服从 N ( 0 , 1 ) N(0,1) N(0,1)的正态分布。




可以不用 dropout 或者较小的 dropout

可以不用 L2 或者较小的 weight decay

可以不用 LRN (local response normalization)

假设输入的 mini-batch 数据是KaTeX parse error: Expected '}', got '\right' at position 32: …t{x_{1 \dots m}\̲r̲i̲g̲h̲t̲},Batch Normalization 的可学习参数是 γ , β \gamma, \beta γ,β,步骤如下:

求 mini-batch 的均值: μ ∗ B ← 1 m ∑ ∗ i = 1 m x i \mu*{\mathcal{B}} \leftarrow \frac{1}{m} \sum*{i=1}^{m} x_{i} μBm1i=1mxi

求 mini-batch 的方差: σ ∗ B 2 ← 1 m ∑ ∗ i = 1 ( x ∗ i − μ ∗ B ) 2 \sigma*{\mathcal{B}}^{2} \leftarrow \frac{1}{m} \sum*{i=1}\left(x*{i}-\mu*{\mathcal{B}}\right)^{2} σB2m1i=1(xiμB)2

标准化: x ^ ∗ i ← x ∗ i − μ ∗ B σ ∗ B 2 + ϵ \widehat{x}*{i} \leftarrow \frac{x*{i}-\mu*{\mathcal{B}}}{\sqrt{\sigma*{B}^{2}+\epsilon}} x iσB2+ϵ xiμB,其中 ϵ \epsilon ϵ 是放置分母为 0 的一个数

affine transform(缩放和平移): y ∗ i ← γ x ^ ∗ i + β ≡ B N ∗ γ , β ( x ∗ i ) y*{i} \leftarrow \gamma \widehat{x}*{i}+\beta \equiv \mathrm{B} \mathrm{N}*{\gamma, \beta}\left(x*{i}\right) yiγx i+βBNγ,β(xi),这个操作可以增强模型的 capacity,也就是让模型自己判断是否要对数据进行标准化,进行多大程度的标准化。如果 γ = σ ∗ B 2 \gamma= \sqrt{\sigma*{B}^{2}} γ=σB2 β = μ ∗ B \beta=\mu*{\mathcal{B}} β=μB,那么就实现了恒等映射。

Batch Normalization 的提出主要是为了解决 Internal Covariate Shift (ICS)。在训练过程中,数据需要经过多层的网络,如果数据在前向传播的过程中,尺度发生了变化,可能会导致梯度爆炸或者梯度消失,从而导致模型难以收敛。

Batch Normalization 层一般在激活函数前一层。


import torch

import numpy as np

import torch.nn as nn

from common_tools import set_seed

set_seed(1) # 设置随机种子

class MLP(nn.Module):

​ def init(self, neural_num, layers=100):

​ super(MLP, self).init()

​ self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

​ self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])

​ self.neural_num = neural_num

​ def forward(self, x):

​ for (i, linear), bn in zip(enumerate(self.linears), self.bns):

​ x = linear(x)

​ # x = bn(x)

​ x = torch.relu(x)

​ if torch.isnan(x.std()):

​ print(“output is nan in {} layers”.format(i))

​ break

​ print(“layers:{}, std:{}”.format(i, x.std().item()))

​ return x

​ def initialize(self):

​ for m in self.modules():

​ if isinstance(m, nn.Linear):

​ # method 1

​ # nn.init.normal_(, std=1) # normal: mean=0, std=1

​ # method 2 kaiming

​ nn.init.kaiming_normal_(

neural_nums = 256

layer_nums = 100

batch_size = 16

net = MLP(neural_nums, layer_nums)

# net.initialize()

inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1

output = net(inputs)


当使用nn.init.kaiming_normal_()初始化后,数据的标准差尺度稳定在 [0.6, 0.9]。

当我们不对网络层进行权值初始化,而是在每个激活函数层之前使用 bn 层,查看数据的标准差尺度稳定在 [0.58, 0.59]。因此 Batch Normalization 可以不用精心设计权值初始化。

下面以人民币二分类实验中的 LeNet 为例,添加 bn 层,对比不带 bn 层的网络和带 bn 层的网络的训练过程。

不带 bn 层的网络,并且使用 kaiming 初始化权值,训练过程如下:

可以看到训练过程中,训练集的 loss 在中间激增到 1.4,不够稳定。

带有 bn 层的 LeNet 定义如下:

class LeNet_bn(nn.Module):

​ def init(self, classes):

​ super(LeNet_bn, self).init()

​ self.conv1 = nn.Conv2d(3, 6, 5)

​ self.bn1 = nn.BatchNorm2d(num_features=6)

​ self.conv2 = nn.Conv2d(6, 16, 5)

​ self.bn2 = nn.BatchNorm2d(num_features=16)

​ self.fc1 = nn.Linear(16 * 5 * 5, 120)

​ self.bn3 = nn.BatchNorm1d(num_features=120)

​ self.fc2 = nn.Linear(120, 84)

​ self.fc3 = nn.Linear(84, classes)

​ def forward(self, x):

​ out = self.conv1(x)

​ out = self.bn1(out)

​ out = F.relu(out)

​ out = F.max_pool2d(out, 2)

​ out = self.conv2(out)

​ out = self.bn2(out)

​ out = F.relu(out)

​ out = F.max_pool2d(out, 2)

​ out = out.view(out.size(0), -1)

​ out = self.fc1(out)

​ out = self.bn3(out)

​ out = F.relu(out)

​ out = F.relu(self.fc2(out))

​ out = self.fc3(out)

​ return out

带 bn 层的网络,并且不使用 kaiming 初始化权值,训练过程如下:

虽然训练过程中,训练集的 loss 也有激增,但只是增加到 0.4,非常稳定。

Batch Normalization in PyTorch

在 PyTorch 中,有 3 个 Batch Normalization 类

nn.BatchNorm1d(),输入数据的形状是 B × C × 1 D f e a t u r e B \times C \times 1D_feature B×C×1Dfeature

nn.BatchNorm2d(),输入数据的形状是 B × C × 2 D f e a t u r e B \times C \times 2D_feature B×C×2Dfeature

nn.BatchNorm3d(),输入数据的形状是 B × C × 3 D f e a t u r e B \times C \times 3D_feature B×C×3Dfeature


torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)





affine:是否需要 affine transform,默认为 True

track_running_stats:True 为训练状态,此时均值和方差会根据每个 mini-batch 改变。False 为测试状态,此时均值和方差会固定




weight:affine transform 中的 γ \gamma γ

bias:affine transform 中的 β \beta β

在训练时,均值和方差采用指数加权平均计算,也就是不仅考虑当前 mini-batch 的值均值和方差还考虑前面的 mini-batch 的均值和方差。


所有的 bn 层都是根据特征维度计算上面 4 个属性,详情看下面例子。


输入数据的形状是 B × C × 1 D f e a t u r e B \times C \times 1D_feature B×C×1Dfeature。在下面的例子中,数据的维度是:(3, 5, 1),表示一个 mini-batch 有 3 个样本,每个样本有 5 个特征,每个特征的维度是 1。那么就会计算 5 个均值和方差,分别对应每个特征维度。momentum 设置为 0.3,第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。



​ batch_size = 3

​ num_features = 5

​ momentum = 0.3

​ features_shape = (1)

​ feature_map = torch.ones(features_shape) # 1D

​ feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0) # 2D

​ feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 3D

​ print(“input data:\n{} shape is {}”.format(feature_maps_bs, feature_maps_bs.shape))

​ bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum)

​ running_mean, running_var = 0, 1

​ mean_t, var_t = 2, 0

​ for i in range(2):

​ outputs = bn(feature_maps_bs)

​ print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))

​ print("iteration:{}, running var:{} ".format(i, bn.running_var))

​ running_mean = (1 - momentum) * running_mean + momentum * mean_t

​ running_var = (1 - momentum) * running_var + momentum * var_t

​ print("iteration:{}, 第二个特征的running mean: {} ".format(i, running_mean))

​ print(“iteration:{}, 第二个特征的running var:{}”.format(i, running_var))


input data:


​ [2.],

​ [3.],

​ [4.],

​ [5.]],

​ [[1.],

​ [2.],

​ [3.],

​ [4.],

​ [5.]],

​ [[1.],

​ [2.],

​ [3.],

​ [4.],

​ [5.]]]) shape is torch.Size([3, 5, 1])

iteration:0, running mean: tensor([0.3000, 0.6000, 0.9000, 1.2000, 1.5000])

iteration:0, running var:tensor([0.7000, 0.7000, 0.7000, 0.7000, 0.7000])

iteration:0, 第二个特征的running mean: 0.6

iteration:0, 第二个特征的running var:0.7

iteration:1, running mean: tensor([0.5100, 1.0200, 1.5300, 2.0400, 2.5500])

iteration:1, running var:tensor([0.4900, 0.4900, 0.4900, 0.4900, 0.4900])

iteration:1, 第二个特征的running mean: 1.02

iteration:1, 第二个特征的running var:0.48999999999999994

虽然两个 mini-batch 的数据是一样的,但是 bn 层的均值和方差却不一样。以第二个特征的均值计算为例,值都是 2。

第一次 bn 层的均值计算: r u n n i n g m e a n = ( 1 − m o m e n t u m ) × p r e r u n n i n g m e a n + m o m e n t u m × m e a n t = ( 1 − 0.3 ) × 0 + 0.3 × 2 = 0.6 running_mean=(1-momentum) \times pre_running_mean + momentum \times mean_t =(1-0.3) \times 0 + 0.3 \times 2 =0.6 runningmean=(1momentum)×prerunningmean+momentum×meant=(10.3)×0+0.3×2=0.6

第二次 bn 层的均值计算: r u n n i n g m e a n = ( 1 − m o m e n t u m ) × p r e r u n n i n g m e a n + m o m e n t u m × m e a n t = ( 1 − 0.3 ) × 0.6 + 0.3 × 2 = 1.02 running_mean=(1-momentum) \times pre_running_mean + momentum \times mean_t =(1-0.3) \times 0.6 + 0.3 \times 2 =1.02 runningmean=(1momentum)×prerunningmean+momentum×meant=(10.3)×0.6+0.3×2=1.02

网络还没进行前向传播之前,断点查看 bn 层的属性如下:



输入数据的形状是 B × C × 2 D f e a t u r e B \times C \times 2D_feature B×C×2Dfeature。在下面的例子中,数据的维度是:(3, 3, 2, 2),表示一个 mini-batch 有 3 个样本,每个样本有 3 个特征,每个特征的维度是 1 × 2 1 \times 2 1×2。那么就会计算 3 个均值和方差,分别对应每个特征维度。momentum 设置为 0.3,第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。



​ batch_size = 3

​ num_features = 3

​ momentum = 0.3

​ features_shape = (2, 2)

​ feature_map = torch.ones(features_shape) # 2D

​ feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0) # 3D

​ feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D

​ # print(“input data:\n{} shape is {}”.format(feature_maps_bs, feature_maps_bs.shape))

​ bn = nn.BatchNorm2d(num_features=num_features, momentum=momentum)

​ running_mean, running_var = 0, 1

​ for i in range(2):

​ outputs = bn(feature_maps_bs)

​ print(“\niter:{}, running_mean: {}”.format(i, bn.running_mean))

​ print(“iter:{}, running_var: {}”.format(i, bn.running_var))

​ print(“iter:{}, weight: {}”.format(i,

​ print(“iter:{}, bias: {}”.format(i,


iter:0, running_mean: tensor([0.3000, 0.6000, 0.9000])

iter:0, running_var: tensor([0.7000, 0.7000, 0.7000])

iter:0, weight: [1. 1. 1.]

iter:0, bias: [0. 0. 0.]

iter:1, running_mean: tensor([0.5100, 1.0200, 1.5300])

iter:1, running_var: tensor([0.4900, 0.4900, 0.4900])

iter:1, weight: [1. 1. 1.]

iter:1, bias: [0. 0. 0.]


输入数据的形状是 B × C × 3 D f e a t u r e B \times C \times 3D_feature B×C×3Dfeature。在下面的例子中,数据的维度是:(3, 2, 2, 2, 3),表示一个 mini-batch 有 3 个样本,每个样本有 2 个特征,每个特征的维度是 2 × 2 × 3 2 \times 2 \times 3 2×2×3。那么就会计算 2 个均值和方差,分别对应每个特征维度。momentum 设置为 0.3,第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。



​ batch_size = 3

​ num_features = 3

​ momentum = 0.3

​ features_shape = (2, 2, 3)

​ feature = torch.ones(features_shape) # 3D

​ feature_map = torch.stack([feature * (i + 1) for i in range(num_features)], dim=0) # 4D

​ feature_maps = torch.stack([feature_map for i in range(batch_size)], dim=0) # 5D

​ # print(“input data:\n{} shape is {}”.format(feature_maps, feature_maps.shape))

​ bn = nn.BatchNorm3d(num_features=num_features, momentum=momentum)

​ running_mean, running_var = 0, 1

​ for i in range(2):

​ outputs = bn(feature_maps)

​ print(“\niter:{}, running_mean.shape: {}”.format(i, bn.running_mean.shape))

​ print(“iter:{}, running_var.shape: {}”.format(i, bn.running_var.shape))

​ print(“iter:{}, weight.shape: {}”.format(i, bn.weight.shape))

​ print(“iter:{}, bias.shape: {}”.format(i, bn.bias.shape))


iter:0, running_mean.shape: torch.Size([3])

iter:0, running_var.shape: torch.Size([3])

iter:0, weight.shape: torch.Size([3])

iter:0, bias.shape: torch.Size([3])

iter:1, running_mean.shape: torch.Size([3])

iter:1, running_var.shape: torch.Size([3])

iter:1, weight.shape: torch.Size([3])

iter:1, bias.shape: torch.Size([3])

Layer Normalization

提出的原因:Batch Normalization 不适用于变长的网络,如 RNN



不再有 running_mean 和 running_var

γ \gamma γ β \beta β 为逐样本的


torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)


normalized_shape:该层特征的形状,可以取 C × H × W C \times H \times W C×H×W H × W H \times W H×W W W W


elementwise_affine:是否需要逐个样本 affine transform

下面代码中,输入数据的形状是 B × C × f e a t u r e B \times C \times feature B×C×feature,(8, 2, 3, 4),表示一个 mini-batch 有 8 个样本,每个样本有 2 个特征,每个特征的维度是 3 × 4 3 \times 4 3×4。那么就会计算 8 个均值和方差,分别对应每个样本。

​ batch_size = 8

​ num_features = 2

​ features_shape = (3, 4)

​ feature_map = torch.ones(features_shape) # 2D

​ feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D

​ feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D

​ # feature_maps_bs shape is [8, 6, 3, 4], B * C * H * W

​ # ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=True)

​ # ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=False)

​ # ln = nn.LayerNorm([6, 3, 4])

​ ln = nn.LayerNorm([2, 3, 4])

​ output = ln(feature_maps_bs)

​ print(“Layer Normalization”)

​ print(ln.weight.shape)

​ print(feature_maps_bs[0, …])

​ print(output[0, …])

Layer Normalization

torch.Size([2, 3, 4])

tensor([[[1., 1., 1., 1.],

​ [1., 1., 1., 1.],

​ [1., 1., 1., 1.]],

​ [[2., 2., 2., 2.],

​ [2., 2., 2., 2.],

​ [2., 2., 2., 2.]]])

tensor([[[-1.0000, -1.0000, -1.0000, -1.0000],

​ [-1.0000, -1.0000, -1.0000, -1.0000],

​ [-1.0000, -1.0000, -1.0000, -1.0000]],

​ [[ 1.0000, 1.0000, 1.0000, 1.0000],

​ [ 1.0000, 1.0000, 1.0000, 1.0000],

​ [ 1.0000, 1.0000, 1.0000, 1.0000]]], grad_fn=)

Layer Normalization 可以设置 normalized_shape 为 (3, 4) 或者 (4)。

Instance Normalization

提出的原因:Batch Normalization 不适用于图像生成。因为在一个 mini-batch 中的图像有不同的风格,不能把这个 batch 里的数据都看作是同一类取标准化。

思路:逐个 instance 的 channel 计算均值和方差。也就是每个 feature map 计算一个均值和方差。

包括 InstanceNorm1d、InstanceNorm2d、InstanceNorm3d。


torch.nn.InstanceNorm1d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)





affine:是否需要 affine transform

track_running_stats:True 为训练状态,此时均值和方差会根据每个 mini-batch 改变。False 为测试状态,此时均值和方差会固定

下面代码中,输入数据的形状是 B × C × 2 D f e a t u r e B \times C \times 2D_feature B×C×2Dfeature,(3, 3, 2, 2),表示一个 mini-batch 有 3 个样本,每个样本有 3 个特征,每个特征的维度是 $2 \times 2 $。那么就会计算 $3 \times 3 $ 个均值和方差,分别对应每个样本的每个特征。如下图所示:


​ batch_size = 3

​ num_features = 3

​ momentum = 0.3

​ features_shape = (2, 2)

​ feature_map = torch.ones(features_shape) # 2D

​ feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D

​ feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D

​ print(“Instance Normalization”)

​ print(“input data:\n{} shape is {}”.format(feature_maps_bs, feature_maps_bs.shape))

​ instance_n = nn.InstanceNorm2d(num_features=num_features, momentum=momentum)

​ for i in range(1):

​ outputs = instance_n(feature_maps_bs)

​ print(outputs)


Instance Normalization

input data:

tensor([[[[1., 1.],

​ [1., 1.]],

​ [[2., 2.],

​ [2., 2.]],

​ [[3., 3.],

​ [3., 3.]]],

​ [[[1., 1.],

​ [1., 1.]],

​ [[2., 2.],

​ [2., 2.]],

​ [[3., 3.],

​ [3., 3.]]],

​ [[[1., 1.],

​ [1., 1.]],

​ [[2., 2.],

​ [2., 2.]],

​ [[3., 3.],

​ [3., 3.]]]]) shape is torch.Size([3, 3, 2, 2])

tensor([[[[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]]],

​ [[[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]]],

​ [[[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]],

​ [[0., 0.],

​ [0., 0.]]]])

Group Normalization

提出的原因:在小 batch 的样本中,Batch Normalization 估计的值不准。一般用在很大的模型中,这时 batch size 就很小。

思路:数据不够,通道来凑。 每个样本的特征分为几组,每组特征分别计算均值和方差。可以看作是 Layer Normalization 的基础上添加了特征分组。


不再有 running_mean 和 running_var

γ \gamma γ β \beta β 为逐通道的


torch.nn.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True)



num_channels:特征数,通道数。注意 num_channels 要可以整除 num_groups


affine:是否需要 affine transform

下面代码中,输入数据的形状是 B × C × 2 D f e a t u r e B \times C \times 2D_feature B×C×2Dfeature,(2, 4, 3, 3),表示一个 mini-batch 有 2 个样本,每个样本有 4 个特征,每个特征的维度是 $3 \times 3 $。num_groups 设置为 2,那么就会计算 $2 \times (4 \div 2) $ 个均值和方差,分别对应每个样本的每个特征。

batch_size = 2

​ num_features = 4

​ num_groups = 2

​ features_shape = (2, 2)

​ feature_map = torch.ones(features_shape) # 2D

​ feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D

​ feature_maps_bs = torch.stack([feature_maps * (i + 1) for i in range(batch_size)], dim=0) # 4D

​ gn = nn.GroupNorm(num_groups, num_features)

​ outputs = gn(feature_maps_bs)

​ print(“Group Normalization”)

​ print(gn.weight.shape)

​ print(outputs[0])


Group Normalization


tensor([[[-1.0000, -1.0000],

​ [-1.0000, -1.0000]],

​ [[ 1.0000, 1.0000],

​ [ 1.0000, 1.0000]],

​ [[-1.0000, -1.0000],

​ [-1.0000, -1.0000]],

​ [[ 1.0000, 1.0000],

​ [ 1.0000, 1.0000]]], grad_fn=)

下面是一个使用PyTorch实现CNN L1正则化的示例代码: ```python import torch import torch.nn as nn import torch.nn.functional as F class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5) self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5) self.fc1 = nn.Linear(32 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2) x = F.relu(self.conv2(x)) x = F.max_pool2d(x, 2) x = x.view(-1, 32 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x model = CNN() criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(10): running_loss = 0.0 for i, data in enumerate(trainloader, 0): inputs, labels = data # 设置L1正则化项 l1_regularization = torch.tensor(0) for param in model.parameters(): l1_regularization += torch.norm(param, 1) l1_regularization *= 0.001 optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) + l1_regularization loss.backward() optimizer.step() running_loss += loss.item() if i % 2000 == 1999: print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000)) running_loss = 0.0 print('Finished Training') ``` 在上述代码中,我们使用了`torch.norm`函数来计算参数的L1范数,并将其乘以一个较小的系数作为L1正则化项,加入到损失函数中。通过调整这个系数的大小,我们可以控制正则化的强度。


