μ
B
=
1
m
∑
i
=
1
m
x
(
i
)
\mu _{B} = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}
μB=m1∑i=1mx(i)
σ
B
2
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
B
)
2
\sigma _{B}^{2} =\frac{1}{m} \sum_{i=1}^{m} \left ( x^{(i)} - \mu _{B} \right )^{2}
σB2=m1∑i=1m(x(i)−μB)2
X
^
(
i
)
=
X
−
μ
B
σ
B
2
+
ϵ
\hat{X} ^{\left(i \right )}= \frac{X - \mu _{B}}{\sqrt{\sigma _{B}^{2}+ \epsilon }}
X^(i)=σB2+ϵX−μB
y
(
i
)
=
γ
⊙
X
^
(
i
)
+
β
y^{\left(i \right )} = \gamma \odot \hat{X} ^{\left(i \right )} + \beta
y(i)=γ⊙X^(i)+β
要学习的参数就是
γ
\gamma
γ和
β
\beta
β ,假如
X
X
X是(batch,features),那
γ
\gamma
γ的维度是
(
1
,
f
e
a
t
u
r
e
s
)
(1,features)
(1,features)
如果
X
X
X是
(
b
a
t
c
h
,
c
h
a
n
n
e
l
,
h
e
i
g
h
t
,
w
i
d
t
h
)
(batch,channel,height,width)
(batch,channel,height,width),那\gamma的维度是(1,channel,1,1)
测试的时候,由于没有批次,就可以用历史的
μ
\mu
μ和
σ
2
\sigma ^{2}
σ2做
E
M
A
EMA
EMA,或者保留每个批次的
μ
\mu
μ和
σ
2
\sigma ^{2}
σ2,用样本估计总体的
#!/usr/bin/env python
# _*_ coding:utf-8 _*_
import torch
from torch import nn
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
# 判断当前模式是训练模式还是预测模式
if not is_training:
# 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# 使用全连接层的情况,计算特征维上的均值和方差
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差。这里我们需要保持
# X的形状以便后面可以做广播运算
mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
# 训练模式下用当前的均值和方差做标准化
X_hat = (X - mean) / torch.sqrt(var + eps)
# 更新移动平均的均值和方差
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # 拉伸和偏移
return Y, moving_mean, moving_var
class BatchNorm(nn.Module):
def __init__(self, num_features, num_dims):
super(BatchNorm, self).__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# 参与求梯度和迭代的拉伸和偏移参数,分别初始化成0和1
self.gamma = nn.Parameter(torch.ones(shape), True)
self.beta = nn.Parameter(torch.zeros(shape), True)
# 不参与求梯度和迭代的变量,全在内存上初始化成0
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.zeros(shape)
def forward(self, X):
# 如果X不在内存上,将moving_mean和moving_var复制到X所在显存上
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# 保存更新过的moving_mean和moving_var, Module实例的traning属性默认为true, 调用.eval()后设成false
Y, self.moving_mean, self.moving_var = batch_norm(self.training,
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y