什么是批量归一化?
归一化是一种数据预处理工具,用于将数值数据调整为通用比例而不扭曲其形状。
通常,当我们将数据输入机器或深度学习算法时,倾向于将值更改为平衡的比例。规范化是为了确保模型可以适当地概括数据。
回到 Batch Normalization,这是一个通过在深度神经网络中添加额外层来使神经网络更快、更稳定的过程。新层对来自上一层的层的输入执行标准化和规范化操作。
那批量归一化中术语 “Batch” 是什么?典型的神经网络是使用一组称为 Batch 的输入数据集进行训练的。同样,批归一化中的归一化过程是分批进行的,而不是单个输入。
例如有一个深度神经网络,如下图所示。
L= Number of layers
Bias =0
Activation Function: Sigmoid
输入 X1、X2、X3、X4 是标准化形式,因为它们来自预处理阶段。当输入通过第一层时,输入 X 和权重矩阵 W 进行点积计算,再经过 sigmoid 函数。以此类推。
h1= σ(W,X)
当数据经过多层神经网络并经过 L 个激活函数时,会导致数据发生内部协变量偏移(Internal Covariate Shift)。在深层网络训练的过程中,由于网络中参数变化而引起内部结点数据分布发生变化,这一过程被称作 Internal Covariate Shift。
输入的归一化
归一化是将数据转换为均值为零和标准差为一的过程。在这一步中, 我们有来自层 的批量输入, 首先, 我们需要计算这个隐藏激活的均值。
这里, 是层神经元的数量。下一步就是计算隐藏激活的标准差。
为此,每个输入中减去平均值,除以标准差和平滑项 (ε) 的总和。平滑项 (ε) 一个非常小的常数,防止分母零值来确保运算中的数值稳定。
重新缩放与偏移
在最后的操作中,将对输入进行重新缩放和偏移。重新缩放参数 γ (gamma) 和偏移参数 β (beta)。
这样,每个神经元的输出都遵循整个批次的标准正态分布。
两个可训练参数 𝛾 和 𝛽 应用线性变换来计算层的输出,这样的步骤允许模型通过调整这两个参数来为每个隐藏层选择最佳分布:
-
𝛾 允许调整标准偏差;
-
𝛽 允许调整偏差,在右侧或左侧移动曲线。
使用 PyTorch 简单实现 Batch Normalization
-
使用 Pytorch 简单实现 BN(应用于 MLP,只有三层隐藏层);
-
在 MNIST 数据集(一个小的数据集)上测试。
导入库和设置超参数
# 导入库 Libs
import numpy as np
import torch
from torch import nn
import torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# 随机种子 Seeds
torch.manual_seed(0)
np.random.seed(0)
# 设置超参数 Hypeparameters
# 两个学习率相同
LR_BASE = 0.01 # 学习率 lr baseline
LR_BN = 0.01 # 网络有 BN 的学习率 lr bn network
num_iterations = 10000 # 50000
valid_steps = 50 # training iterations before validation
verbose = True
导入 MINST 数据集
# 导入 MINST 数据集
transform_imgs = torchvision.transforms.Compose([torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize( # Not mentionned in the paper. No normalization
(0.1307,), (0.3081,))]) # emphasizes the smoothing effect on hidden layers
# activation.
mnist_trainset = datasets.MNIST(root='./data',
train=True,
download=True,
transform=transform_imgs)
mnist_testset = datasets.MNIST(root='./data',
train=False,
download=True,
transform=transform_imgs)
# Dataset loader
train_loader_params = {'shuffle': True, 'batch_size' : 60, 'drop_last' : False}
test_loader_params = {'shuffle': False, 'batch_size' : 60, 'drop_last' : False}
train_loader = DataLoader(mnist_trainset, **train_loader_params )
test_loader = DataLoader(mnist_testset, **test_loader_params )
# 每个 batch : (imgs, targets)
# 图像 : (batch_size=60, channels=1, height=28, width=28) tensor (float32)
# 标签 : (batch_size=60) tensor (int64)
复现 BN
# BatchNorm 复现
class myBatchNorm2d(nn.Module):
def __init__(self, input_size = None , epsilon = 1e-3, momentum = 0.99):
super(myBatchNorm2d, self).__init__()
assert input_size, print('Missing input_size parameter.')
# 定义训练期间的 Batch 均值和方差
self.mu = torch.zeros(1, input_size)
# 方差是与平均值的平方偏差的平均值,即 var = mean(abs(x-x.mean())** 2)
self.var = torch.ones(1, input_size)
# 常数,为了数值稳定
self.epsilon = epsilon
# Exponential moving average for mu & var update
self.it_call = 0 # training iterations
self.momentum = momentum # EMA smoothing
# 可训练的参数
self.beta = torch.nn.Parameter(torch.zeros(1, input_size))
self.gamma = torch.nn.Parameter(torch.ones(1, input_size))
# Batch size on which the normalization is computed
self.batch_size = 0
def forward(self, x):
# [batch_size, input_size]
self.it_call += 1
if self.training :
if( self.batch_size == 0 ):
# First iteration : save batch_size
self.batch_size = x.shape[0]
# Training : compute BN pass
batch_mu = (x.sum(dim=0)/x.shape[0]).unsqueeze(0) # [1, input_size]
batch_var = (x.var(dim=0)/x.shape[0]).unsqueeze(0) # [1, input_size]
x_normalized = (x-batch_mu)/torch.sqrt(batch_var + self.epsilon) # [batch_size, input_size]
x_bn = self.gamma * x_normalized + self.beta # [batch_size, input_size]
# 更新 mu & std
if(x.shape[0] == self.batch_size):
running_mu = batch_mu
running_var = batch_var
else:
running_mu = batch_mu*self.batch_size/x.shape[0]
running_var = batch_var*self.batch_size/x.shape[0]
self.mu = running_mu * (self.momentum/self.it_call) + \
self.mu * (1 - (self.momentum/self.it_call))
self.var = running_var * (self.momentum/self.it_call) + \
self.var * (1 - (self.momentum/self.it_call))
else:
# 推理 : compute BN pass using estimated mu & var
if (x.shape[0] == self.batch_size):
estimated_mu = self.mu
estimated_var = self.var
else :
estimated_mu = self.mu*x.shape[0]/self.batch_size
estimated_var = self.var*x.shape[0]/self.batch_size
x_normalized = (x-estimated_mu)/torch.sqrt(estimated_var + self.epsilon) # [batch_size, input_size]
x_bn = self.gamma * x_normalized + self.beta # [batch_size, input_size]
return x_bn # [batch_size, output_size=input_size]
跟踪记录模型的状态
class ActivationTracker(nn.Module):
'''Identity module, which keep track of the current activation during validation.'''
def __init__(self):
super(ActivationTracker, self).__init__()
# Keep tack of [0.15, 0.5, 0.85] percentiles
self.percents_activation_track = [15, 50, 85]
self.all_percents_activation = []
def get_all_activations(self):
return np.array(self.all_percents_activation)
def forward(self, x):
if not self.training :
percents_activation = np.percentile(x.detach().flatten(), self.percents_activation_track)
self.all_percents_activation.append(percents_activation)
#print('percents_activation = ', percents_activation)
return x
构建 2 个三层隐藏层的网络结构,唯一区别是是否带有 BN
# 网络结构
def init_weights(model):
for module in model:
if type(module) == nn.Linear:
torch.nn.init.normal_(module.weight, mean=0.0, std=1.0) # "Random Gaussian value"
#torch.nn.init.xavier_uniform_(module.weight)
module.bias.data.fill_(0.)
input_size = 784
# 基准网络
baseline_model = nn.Sequential(nn.Linear(input_size,100), #1
nn.Sigmoid(),
nn.Linear(100,100), #2
nn.Sigmoid(),
nn.Linear(100,100), #3
ActivationTracker(),
nn.Sigmoid(),
nn.Linear(100,10) # out
)
init_weights(baseline_model)
# 基准网络带有 BN
bn_model = nn.Sequential(nn.Linear(input_size,100), #1
myBatchNorm2d(100),
nn.Sigmoid(),
nn.Linear(100,100), #2
myBatchNorm2d(100),
nn.Sigmoid(),
nn.Linear(100,100), #3
myBatchNorm2d(100),
ActivationTracker(),
nn.Sigmoid(),
nn.Linear(100,10) # out
)
init_weights(bn_model)
建立训练和测试循环
# 损失函数 & 评价指标
criterion = nn.CrossEntropyLoss()
metric = accuracy_score
# 测试循环
def valid_loop(model, valid_loader, criterion, metric, epoch, verbose = True):
sum_loss = 0
sum_score = 0
for it, (imgs, targets) in enumerate(valid_loader, start=1):
imgs = imgs.view(-1,784)
with torch.no_grad():
out = model(imgs) # [batch_size,num_class]
preds = torch.argmax(out.detach(), dim=1) # [batch_size]
loss = criterion(out,targets)
score = metric(targets, preds)
sum_loss += loss
sum_score += score
return sum_score/it, sum_loss/it
## 训练模型
def train_loop(model, train_loader, valid_loader, optimizer, scheduler, criterion, metric, verbose = True):
# 损失函数 & 评价 lists
valid_stats = []
epochs_valid_stats = []
with tqdm(range(num_epochs), desc = "Train epochs") as epochs_bar :
for e in epochs_bar:
# 训练阶段
with tqdm(train_loader, leave=False) as it_bar:
for it, (imgs, targets) in enumerate(it_bar, start=1):
imgs = imgs.view(-1,784)
out = model(imgs) # [batch_size,num_class]
preds = torch.argmax(out.detach(), dim=1) # [batch_size]
loss = criterion(out,targets)
score = metric(targets, preds)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if(it % valid_steps == 0):
# 测试阶段
model.eval()
valid_score, valid_loss = valid_loop(model, valid_loader, criterion, metric, e, verbose)
valid_stats.append([valid_score.astype(np.float32), \
valid_loss.detach().numpy().astype(np.float32)])
epochs_valid_stats.append(it+e*len(train_loader))
if(verbose):
it_bar.set_postfix(valid_loss=valid_loss.item(), valid_score=valid_score)
model.train()
scheduler.step()
return np.array(valid_stats), epochs_valid_stats
def init_optim_and_scheduler(model, lr = 0.1):
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9) # momentum not mentioned
#optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10) # Not mentioned in the paper
return optimizer, scheduler
#---------------------------
# 训练循环
#---------------------------
num_epochs = int(num_iterations/len(train_loader))
# 没有 BN
print('-'*15, 'BASELINE MODEL', '-'*15)
optimizer, scheduler = init_optim_and_scheduler(baseline_model, lr = LR_BASE)
valid_stats_base, epochs_stats = train_loop(baseline_model, train_loader, test_loader, \
optimizer, scheduler, criterion, metric, verbose = verbose)
# 有 BN
print('-'*15, 'BATCH NORMALIZED MODEL', '-'*15)
optimizer, scheduler = init_optim_and_scheduler(bn_model, lr = LR_BN)
valid_stats_bn, epochs_stats = train_loop(bn_model, train_loader, test_loader, \
optimizer, scheduler, criterion, metric, verbose = verbose)
绘图展示 BN 层在评价指标和损失的影响
if epochs_stats:
fig, ax = plt.subplots(1, 2, figsize=(10*2, 5))
#ax.clear()
# Scores
ax[0].plot(epochs_stats, valid_stats_base[:, 0], 'k--', \
label = f"Train (baseline) score {valid_stats_base[-1, 0]:.4f}")
ax[0].plot(epochs_stats, valid_stats_bn[:, 0], 'b-', \
label = f"Train (with BN) score {valid_stats_bn[-1, 0]:.4f}")
# Losses
ax[1].plot(epochs_stats, valid_stats_base[:, 1], 'k--', \
label = f"Train (baseline) loss {valid_stats_base[-1, 1]:.4f}")
ax[1].plot(epochs_stats, valid_stats_bn[:, 1], 'b-', \
label = f"Train (with BN) loss {valid_stats_bn[-1, 1]:.4f}")
ax[0].legend(); ax[1].legend()
plt.show()