目录
Batch Normalization 和 Batch Size
Layer Normalization 和 Batch Size
卷积层的batch norm--spatial batchnorm
The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, they propose to insert into the network layers that normalize batches. At training time, such a layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.
It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.
导包和处理数据
# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0) # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"
%load_ext autoreload
%autoreload 2
def rel_error(x, y):
"""Returns relative error."""
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
def print_mean_std(x,axis=0):
print(f" means: {x.mean(axis=axis)}")
print(f" stds: {x.std(axis=axis)}\n")
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
print(f"{k}: {v.shape}")
BatchNorm
forward
The outputs from the last layer of the network should not be normalized.
def batchnorm_forward(x, gamma, beta, bn_param):
"""
During training the sample mean and (uncorrected) sample variance are
computed from minibatch statistics and used to normalize the incoming data.
During training we also keep an exponentially decaying running mean of the
mean and variance of each feature, and these averages are used to normalize
data at test-time.
At each timestep we update the running averages for mean and variance using
an exponential decay based on the momentum parameter:
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
Note that the batch normalization paper suggests a different test-time
behavior: they compute sample mean and variance for each feature using a
large number of training images rather than using a running average. For
this implementation we have chosen to use running averages instead since
they do not require an additional estimation step; the torch7
implementation of batch normalization also uses running averages.
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param["mode"]
eps = bn_param.get("eps", 1e-5)
momentum = bn_param.get("momentum", 0.9)
N, D = x.shape
running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == "train":
x_mean = np.mean(x, axis = 0)
x_var = np.var(x, axis = 0) + eps
x_norm = (x - x_mean) / np.sqrt(x_var)
out = x_norm * gamma + beta # (N, D)
cache = (x, x_norm, gamma, x_mean, x_var)
# Store the updated running means back into bn_param
bn_param["running_mean"] = momentum * running_mean + (1 - momentum) * x_mean
bn_param["running_var"] = momentum * running_var + (1 - momentum) * x_var
elif mode == "test":
x_norm = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x_norm + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
return out, cache
backward
可以参考:
【cs231n】Batchnorm及其反向传播_JoeYF_的博客-CSDN博客_batchnorm反向传播
Batch Normalization的反向传播解说_macan_dct的博客-CSDN博客_batchnorm反向传播
Understanding the backward pass through Batch Normalization Layer
理解Batch Normalization(批量归一化)_坚硬果壳_的博客-CSDN博客
步骤9:
步骤8:
步骤7:
步骤6:
步骤5:
步骤4:
步骤3:
步骤2:
步骤1:
步骤0:
复杂版
def batchnorm_backward(dout, cache):
"""
Inputs:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from batchnorm_forward.
Returns a tuple of:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
dx, dgamma, dbeta = None, None, None
x, x_norm, gamma, mean, var = cache
N, D = dout.shape
dbeta = np.sum(dout, axis = 0)
dgamma = np.sum(dout * x_norm, axis = 0) # (N, D) * (N, D)
# 注意求dgamma的时候没有使用矩阵乘法。在FC层的反向传播需要中矩阵乘法
#可能因为正向传播的时候就没有使用真正的矩阵乘法,用的是(N, D) * (D,)
dx_norm = dout * gamma # (N, D) * (D,) x_norm是归一处理后的x
x_mu = x - mean
std = np.sqrt(var) # var已经包含了eps
ivar = 1 / std # (D,)
# 求x_norm的式子中,分子分母中都含有x_mu,先求对这个x_mu的导数
# 将std的倒数视作一个乘数
dxmu1 = dx_norm * ivar # (N, D) * (D,)
divar = np.sum(dx_norm * x_mu, axis = 0) #
dsqrtvar = - 1/ (std ** 2) * divar # 倒数对根号项的导数 根号项正是std
dvar = 0.5 / std * dsqrtvar # 根号项对var(根号项中非eps项)的导数
dsquare = 1/N * np.ones((N, D)) * dvar #求var时对(N,D)矩阵进行了求列平均操作
dxmu2 = 2 * x_mu * dsquare # 平方项对x_mu求导
dx1 = dxmu1 + dxmu2
#下面求x_mu对x的导数,注意平均值mu也是x的函数,所以导数是一个和
# 先对mu求导,然后mu对x求导
dmu = -1 * np.sum(dx1, axis = 0) # (D,) 广播的反向传播
dx2 = 1/N * np.ones((N, D)) * dmu # 求mu时采用了求平均的操作
dx = dx1 + dx2
return dx, dgamma, dbeta
快速版
def batchnorm_backward_alt(dout, cache):
"""
derive a simple expression for the backward pass.
"""
dx, dgamma, dbeta = None, None, None
x, x_norm, gamma, mean, var = cache
std = np.sqrt(var)
dgamma = np.sum(dout * x_norm, axis = 0) # (N, D) * (N, D)
dbeta = np.sum(dout, axis = 0)
N = 1.0 * x.shape[0]
dfdu = dout * gamma
# 将x_norm拆成x,mu,var三项,分别对x求导,然后加起来
dfdv = np.sum(dfdu * (x - mean) * -0.5 * var ** -1.5, axis = 0) # f对var求导
dfdw = np.sum(dfdu * -1 / std, axis = 0) + \
dfdv * np.sum(-2/N * (x - mean), axis = 0) # f对mu求导,包括var项对mu求导
dx = dfdu / std + dfdv * 2/N * (x - mean) + dfdw / N # 三项分别对x求导并求和
'''
# 也可以用下面方式算,看不懂
N = dout.shape[0]
dfdz = dout * gamma # 下面用 z 指代 x_norm [NxD]
dfdz_sum = np.sum(dfdz,axis=