Assignment Two(三)Batch Normalization
使深层网络更容易训练的一种方法是使用更复杂的优化程序,例如SGD + momentum,RMSProp或Adam。另一种策略是更改网络的体系结构,使其更容易训练。遵循这些思路的一个想法是批量标准化。(机器学习方法的输入数据由零相关且单位方差为零的不相关特征组成时,它们往往会更好地工作)
在训练时,批处理归一化层使用少量数据来估计每个特征的均值和标准差。然后,将这些估计的均值和标准偏差用于对微型批次的特征进行居中和标准化。在训练过程中,将保留这些均值和标准差的移动平均值,并在测试时将这些移动平均值用于对中和对特征进行归一化。
这种标准化策略可能会降低网络的表示能力,因为有时对于某些层来说,具有非零均值或单位方差的特征可能是最佳的。为此,批处理归一化层包括每个要素维的可学习的平移和缩放参数。
好处:减少坏初始化的影响;加快模型的收敛速度;可以用大些的学习率;能有效地防止过拟合
一、Batch Normalization
1.1 batchnorm_forward
def batchnorm_forward(x, gamma, beta, bn_param):
#1.1.1
mode = bn_param["mode"]
eps = bn_param.get("eps", 1e-5)
momentum = bn_param.get("momentum", 0.9)
N, D = x.shape
running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == "train":
#1.1.2
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
sample_mean = np.mean(x, axis = 0)
sample_var = np.var(x, axis = 0)
x_hat = (x - sample_mean) / np.sqrt(sample_var + eps)
# 神经网络学习自己想要的参数
out = gamma * x_hat + beta
cache = (x, gamma, beta, x_hat, sample_mean, sample_var, eps)
#测试用
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
pass
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
elif mode == "test":
#######################################################################
# TODO: Implement the test-time forward pass for batch normalization. #
# Use the running mean and variance to normalize the incoming data, #
# then scale and shift the normalized data using gamma and beta. #
# Store the result in the out variable. #
#######################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# 使用训练得到的均值和方差
x_hat = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x_hat + beta
pass
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# 保存均值和方差,测试时使用
bn_param["running_mean"] = running_mean
bn_param["running_var"] = running_var
return out, cache
代码分析 1.1.1:
注意:在BNpaper中使用大量训练数据的均值和方差来代表测试时的均值和方差,而不是使用运行时保存的数据
但是这里的代码是用的运行时的数据,因为这不需要额外的计算步骤,而在torch7中也是使用运行时保存的数据
在每个循环中,我们使用基于动量参数的指数衰减来更新均值和方差的移动平均值
eps是一个小的常数,保证分母不为零
代码分析 1.1.2:
使用minibatch统计信息计算均值和方差,使用这些统计信息对输入的数据进行归一化,并使用gamma和beta缩放和移位归一化的数据
输出存储在out中。向后传递所需的任何参数都存储在cache中
算出的样本均值和方差与动量变量一起使用,以更新运行均值和运行方差,存储在running_mean和running_var中
Test_1.1.1
在批量归一化之前和之后,通过检查均值和特征方差来检查训练时的前向通行性(模拟两层网络)
np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)
print('Before batch normalization:')
print_mean_std(a,axis=0)
gamma = np.ones((D3,))
beta = np.zeros((D3,))
# Means 接近 0 and stds 接近 1
print('After batch normalization (gamma=1, beta=0)')
a_norm, _ = batchnorm_forward(a, gamma, beta, {
'mode': 'train'})
print_mean_std(a_norm,axis=0)
gamma = np.asarray([1.0, 2.0, 3.0])
beta = np.asarray([11.0, 12.0, 13.0])
# means 接近 beta and stds 接近 gamma
print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')
a_norm, _ = batchnorm_forward(a, gamma, beta, {
'mode': 'train'})
print_mean_std(a_norm,axis=0)
输出:
Before batch normalization:
means: [ -2.3814598 -13.18038246 1.91780462]
stds: [27.18502186 34.21455511 37.68611762]
After batch normalization (gamma=1, beta=0)
means: [7.10542736e-17 7.82707232e-17 9.71445147e-18]
stds: [0.99999999 1. 1. ]
After batch normalization (gamma= [1. 2. 3.] , beta= [11. 12. 13.] )
means: [11. 12. 13.]
stds: [0.99999999 1.99999999 2.99999999]
Test_1.1.2
通过多次运行训练时间正向通过以预热运行平均值来检查测试时间正向通过,然后在测试时间正向通过之后检查激活的均值和方差。
np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
bn_param = {
'mode': 'train'}
gamma = np.ones(D3)
beta = np.zeros(D3)
for t in range(50):
X = np.random.randn(N, D1)
a = np.maximum(0, X.dot(W1)).dot(W2)
batchnorm_forward(a, gamma, beta, bn_param)
bn_param['mode'] = 'test'
X = np.random.randn(N, D1)
a = np.maximum(0, X.dot(W1)).dot(W2)
a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)
# Means should be close to zero and stds close to one
print('After batch normalization (test-time):')
print_mean_std(a_norm,axis=0)
输出:
After batch normalization (test-time):
means: [-0.03927354 -0.04349152 -0.10452688]
stds: [1.01531428 1.01238373 0.97819988]
1.2 batchnorm_backward
反向求导过程非常复杂。。。
实现反向传递以进行批量归一化。 将结果存储在dx,dgamma和dbeta变量中
def batchnorm_backward(dout, cache):
#1.2.1
dx, dgamma, dbeta = None, None, None
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
x, gamma, beta, x_norm, mean, var, eps = cache
N = x.shape[0]
dx_norm = dout*gamma
sqrt_var = 1/np.sqrt(var+eps)
# 由于var的维度是(D,),所以需要np.sum
dvar = np.sum((x-mean) * dx_norm * (-0.5) * sqrt_var**3,axis=0