批标准化(Bactch Normalization,BN)是为了克服神经网络加深导致难以训练而诞生的,随着神经网络深度加深,训练起来就会越来越困难,收敛速度回很慢,常常会导致梯度弥散问题(Vanishing Gradient Problem)。
统计机器学习中有一个经典的假设:Source Domain 和 Target Domain的数据分布是一致的。也就是说,训练数据和测试数据是满足相同分布的。这是通过训练数据获得的模型能够在测试集上获得好的效果的一个基本保障。
Convariate Shift是指训练集的样本数据和目标样本集分布不一致时,训练得到的模型无法很好的Generalization。它是分布不一致假设之下的一个分支问题,也就是指Sorce Domain和Target Domain的条件概率一致的,但是其边缘概率不同。的确,对于神经网络的各层输出,在经过了层内操作后,各层输出分布就会与对应的输入信号分布不同,而且差异会随着网络深度增大而加大了,但每一层所指向的Label仍然是不变的。
解决办法:一般是根据训练样本和目标样本的比例对训练样本做一个矫正。所以,通过引入Bactch Normalization来标准化某些层或者所有层的输入,从而固定每层输入信息的均值和方差。
方法:Bactch Normalization一般用在非线性映射(激活函数)之前,对x=Wu+b做标准化,是结果(输出信号各个维度)的均值为0,方差为1。让每一层的输入有一个稳定的分布会有利于网络的训练。
优点:Bactch Normalization通过标准化让激活函数分布在线性区间,结果就是加大了梯度,让模型更大胆的进行梯度下降,具有如下优点:
- 加大搜索的步长,加快收敛的速度;
- 更容易跳出局部最小值;
- 破坏原来的数据分布,一定程度上缓解了过拟合;
因此,在遇到神经网络收敛速度很慢或梯度爆炸(Gradient Explore)等无法训练的情况系啊,都可以尝试用Bactch Normalization来解决。
梯度爆炸:梯度非常大,链式求导后乘积就变得很大,使权重变得非常大,产生指数级爆炸。
在GANs中我们充分利用bactch normalization的优点,加快模型的收敛速度。
batch normalization源码位于:
C:\Anaconda3\envs\tensorflow\Lib\site-packages\tensorflow\python\ops\nn_impl.py
调用方法:tf.nn.batch_normalization()
- def batch_normalization(x,
- mean,
- variance,
- offset,
- scale,
- variance_epsilon,
- name=None):
- r"""Batch normalization.
- As described in http://arxiv.org/abs/1502.03167.
- Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
- `scale` \\(\gamma\\) to it, as well as an `offset` \\(\beta\\):
- \\(\frac{\gamma(x-\mu)}{\sigma}+\beta\\)
- `mean`, `variance`, `offset` and `scale` are all expected to be of one of two
- shapes:
- * In all generality, they can have the same number of dimensions as the
- input `x`, with identical sizes as `x` for the dimensions that are not
- normalized over (the 'depth' dimension(s)), and dimension 1 for the
- others which are being normalized over.
- `mean` and `variance` in this case would typically be the outputs of
- `tf.nn.moments(..., keep_dims=True)` during training, or running averages
- thereof during inference.
- * In the common case where the 'depth' dimension is the last dimension in
- the input tensor `x`, they may be one dimensional tensors of the same
- size as the 'depth' dimension.
- This is the case for example for the common `[batch, depth]` layout of
- fully-connected layers, and `[batch, height, width, depth]` for
- convolutions.
- `mean` and `variance` in this case would typically be the outputs of
- `tf.nn.moments(..., keep_dims=False)` during training, or running averages
- thereof during inference.
- Args:
- x: Input `Tensor` of arbitrary dimensionality.
- mean: A mean `Tensor`.
- variance: A variance `Tensor`.
- offset: An offset `Tensor`, often denoted \\(\beta\\) in equations, or
- None. If present, will be added to the normalized tensor.
- scale: A scale `Tensor`, often denoted \\(\gamma\\) in equations, or
- `None`. If present, the scale is applied to the normalized tensor.
- variance_epsilon: A small float number to avoid dividing by 0.
- name: A name for this operation (optional).
- Returns:
- the normalized, scaled, offset tensor.
- """
- with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):
- inv = math_ops.rsqrt(variance + variance_epsilon)
- if scale is not None:
- inv *= scale
- return x * inv + (offset - mean * inv
- if offset is not None else -mean * inv)
C:\Anaconda3\envs\tensorflow\Lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py
调用方法:tf.contrib.layers.batch_norm
- def batch_norm(inputs,
- decay=0.999,
- center=True,
- scale=False,
- epsilon=0.001,
- activation_fn=None,
- param_initializers=None,
- param_regularizers=None,
- updates_collections=ops.GraphKeys.UPDATE_OPS,
- is_training=True,
- reuse=None,
- variables_collections=None,
- outputs_collections=None,
- trainable=True,
- batch_weights=None,
- fused=False,
- data_format=DATA_FORMAT_NHWC,
- zero_debias_moving_mean=False,
- scope=None,
- renorm=False,
- renorm_clipping=None,
- renorm_decay=0.99):
- """Adds a Batch Normalization layer from http://arxiv.org/abs/1502.03167.
- "Batch Normalization: Accelerating Deep Network Training by Reducing
- Internal Covariate Shift"
- Sergey Ioffe, Christian Szegedy
- Can be used as a normalizer function for conv2d and fully_connected.
- Note: When is_training is True the moving_mean and moving_variance need to be
- updated, by default the update_ops are placed in `tf.GraphKeys.UPDATE_OPS` so
- they need to be added as a dependency to the `train_op`, example:
- update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
- with tf.control_dependencies(update_ops):
- train_op = optimizer.minimize(loss)
- One can set updates_collections=None to force the updates in place, but that
- can have speed penalty, especially in distributed settings.
- Args:
- inputs: A tensor with 2 or more dimensions, where the first dimension has
- `batch_size`. The normalization is over all but the last dimension if
- `data_format` is `NHWC` and the second dimension if `data_format` is
- `NCHW`.
- decay: Decay for the moving average. Reasonable values for `decay` are close
- to 1.0, typically in the multiple-nines range: 0.999, 0.99, 0.9, etc.
- Lower `decay` value (recommend trying `decay`=0.9) if model experiences
- reasonably good training performance but poor validation and/or test
- performance. Try zero_debias_moving_mean=True for improved stability.
- center: If True, add offset of `beta` to normalized tensor. If False, `beta`
- is ignored.
- scale: If True, multiply by `gamma`. If False, `gamma` is
- not used. When the next layer is linear (also e.g. `nn.relu`), this can be
- disabled since the scaling can be done by the next layer.
- epsilon: Small float added to variance to avoid dividing by zero.
- activation_fn: Activation function, default set to None to skip it and
- maintain a linear activation.
- param_initializers: Optional initializers for beta, gamma, moving mean and
- moving variance.
- param_regularizers: Optional regularizer for beta and gamma.
- updates_collections: Collections to collect the update ops for computation.
- The updates_ops need to be executed with the train_op.
- If None, a control dependency would be added to make sure the updates are
- computed in place.
- is_training: Whether or not the layer is in training mode. In training mode
- it would accumulate the statistics of the moments into `moving_mean` and
- `moving_variance` using an exponential moving average with the given
- `decay`. When it is not in training mode then it would use the values of
- the `moving_mean` and the `moving_variance`.
- reuse: Whether or not the layer and its variables should be reused. To be
- able to reuse the layer scope must be given.
- variables_collections: Optional collections for the variables.
- outputs_collections: Collections to add the outputs.
- trainable: If `True` also add variables to the graph collection
- `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
- batch_weights: An optional tensor of shape `[batch_size]`,
- containing a frequency weight for each batch item. If present,
- then the batch normalization uses weighted mean and
- variance. (This can be used to correct for bias in training
- example selection.)
- fused: Use nn.fused_batch_norm if True, nn.batch_normalization otherwise.
- data_format: A string. `NHWC` (default) and `NCHW` are supported.
- zero_debias_moving_mean: Use zero_debias for moving_mean. It creates a new
- pair of variables 'moving_mean/biased' and 'moving_mean/local_step'.
- scope: Optional scope for `variable_scope`.
- renorm: Whether to use Batch Renormalization
- (https://arxiv.org/abs/1702.03275). This adds extra variables during
- training. The inference is the same for either value of this parameter.
- renorm_clipping: A dictionary that may map keys 'rmax', 'rmin', 'dmax' to
- scalar `Tensors` used to clip the renorm correction. The correction
- `(r, d)` is used as `corrected_value = normalized_value * r + d`, with
- `r` clipped to [rmin, rmax], and `d` to [-dmax, dmax]. Missing rmax, rmin,
- dmax are set to inf, 0, inf, respectively.
- renorm_decay: Momentum used to update the moving means and standard
- deviations with renorm. Unlike `momentum`, this affects training
- and should be neither too small (which would add noise) nor too large
- (which would give stale estimates). Note that `decay` is still applied
- to get the means and variances for inference.
- Returns:
- A `Tensor` representing the output of the operation.
- Raises:
- ValueError: If `batch_weights` is not None and `fused` is True.
- ValueError: If `param_regularizers` is not None and `fused` is True.
- ValueError: If `data_format` is neither `NHWC` nor `NCHW`.
- ValueError: If the rank of `inputs` is undefined.
- ValueError: If rank or channels dimension of `inputs` is undefined.
- """
- if fused:
- if batch_weights is not None:
- raise ValueError('Weighted mean and variance is not currently '
- 'supported for fused batch norm.')
- if param_regularizers is not None:
- raise ValueError('Regularizers are not currently '
- 'supported for fused batch norm.')
- if renorm:
- raise ValueError('Renorm is not supported for fused batch norm.')
- return _fused_batch_norm(
- inputs,
- decay=decay,
- center=center,
- scale=scale,
- epsilon=epsilon,
- activation_fn=activation_fn,
- param_initializers=param_initializers,
- updates_collections=updates_collections,
- is_training=is_training,
- reuse=reuse,
- variables_collections=variables_collections,
- outputs_collections=outputs_collections,
- trainable=trainable,
- data_format=data_format,
- zero_debias_moving_mean=zero_debias_moving_mean,
- scope=scope)
- if data_format not in (DATA_FORMAT_NCHW, DATA_FORMAT_NHWC):
- raise ValueError('data_format has to be either NCHW or NHWC.')
- layer_variable_getter = _build_variable_getter()
- with variable_scope.variable_scope(
- scope, 'BatchNorm', [inputs], reuse=reuse,
- custom_getter=layer_variable_getter) as sc:
- inputs = ops.convert_to_tensor(inputs)
- # Determine whether we can use the core layer class.
- if (batch_weights is None and
- updates_collections is ops.GraphKeys.UPDATE_OPS and
- not zero_debias_moving_mean):
- # Use the core layer class.
- axis = 1 if data_format == DATA_FORMAT_NCHW else -1
- if not param_initializers:
- param_initializers = {}
- beta_initializer = param_initializers.get('beta',
- init_ops.zeros_initializer())
- gamma_initializer = param_initializers.get('gamma',
- init_ops.ones_initializer())
- moving_mean_initializer = param_initializers.get(
- 'moving_mean', init_ops.zeros_initializer())
- moving_variance_initializer = param_initializers.get(
- 'moving_variance', init_ops.ones_initializer())
- if not param_regularizers:
- param_regularizers = {}
- beta_regularizer = param_regularizers.get('beta')
- gamma_regularizer = param_regularizers.get('gamma')
- layer = normalization_layers.BatchNormalization(
- axis=axis,
- momentum=decay,
- epsilon=epsilon,
- center=center,
- scale=scale,
- beta_initializer=beta_initializer,
- gamma_initializer=gamma_initializer,
- moving_mean_initializer=moving_mean_initializer,
- moving_variance_initializer=moving_variance_initializer,
- beta_regularizer=beta_regularizer,
- gamma_regularizer=gamma_regularizer,
- trainable=trainable,
- renorm=renorm,
- renorm_clipping=renorm_clipping,
- renorm_momentum=renorm_decay,
- name=sc.name,
- _scope=sc,
- _reuse=reuse)
- outputs = layer.apply(inputs, training=is_training)
- # Add variables to collections.
- _add_variable_to_collections(
- layer.moving_mean, variables_collections, 'moving_mean')
- _add_variable_to_collections(
- layer.moving_variance, variables_collections, 'moving_variance')
- if layer.beta:
- _add_variable_to_collections(layer.beta, variables_collections, 'beta')
- if layer.gamma:
- _add_variable_to_collections(
- layer.gamma, variables_collections, 'gamma')
- if activation_fn is not None:
- outputs = activation_fn(outputs)
- return utils.collect_named_outputs(outputs_collections,
- sc.original_name_scope, outputs)
- # Not supported by layer class: batch_weights argument,
- # and custom updates_collections. In that case, use the legacy BN
- # implementation.
- # Custom updates collections are not supported because the update logic
- # is different in this case, in particular w.r.t. "forced updates" and
- # update op reuse.
- if renorm:
- raise ValueError('renorm is not supported with batch_weights, '
- 'updates_collections or zero_debias_moving_mean')
- inputs_shape = inputs.get_shape()
- inputs_rank = inputs_shape.ndims
- if inputs_rank is None:
- raise ValueError('Inputs %s has undefined rank.' % inputs.name)
- dtype = inputs.dtype.base_dtype
- if batch_weights is not None:
- batch_weights = ops.convert_to_tensor(batch_weights)
- inputs_shape[0:1].assert_is_compatible_with(batch_weights.get_shape())
- # Reshape batch weight values so they broadcast across inputs.
- nshape = [-1] + [1 for _ in range(inputs_rank - 1)]
- batch_weights = array_ops.reshape(batch_weights, nshape)
- if data_format == DATA_FORMAT_NCHW:
- moments_axes = [0] + list(range(2, inputs_rank))
- params_shape = inputs_shape[1:2]
- # For NCHW format, rather than relying on implicit broadcasting, we
- # explicitly reshape the params to params_shape_broadcast when computing
- # the moments and the batch normalization.
- params_shape_broadcast = list(
- [1, inputs_shape[1].value] + [1 for _ in range(2, inputs_rank)])
- else:
- moments_axes = list(range(inputs_rank - 1))
- params_shape = inputs_shape[-1:]
- params_shape_broadcast = None
- if not params_shape.is_fully_defined():
- raise ValueError('Inputs %s has undefined channels dimension %s.' % (
- inputs.name, params_shape))
- # Allocate parameters for the beta and gamma of the normalization.
- beta, gamma = None, None
- if not param_initializers:
- param_initializers = {}
- if center:
- beta_collections = utils.get_variable_collections(variables_collections,
- 'beta')
- beta_initializer = param_initializers.get('beta',
- init_ops.zeros_initializer())
- beta = variables.model_variable('beta',
- shape=params_shape,
- dtype=dtype,
- initializer=beta_initializer,
- collections=beta_collections,
- trainable=trainable)
- if scale:
- gamma_collections = utils.get_variable_collections(variables_collections,
- 'gamma')
- gamma_initializer = param_initializers.get('gamma',
- init_ops.ones_initializer())
- gamma = variables.model_variable('gamma',
- shape=params_shape,
- dtype=dtype,
- initializer=gamma_initializer,
- collections=gamma_collections,
- trainable=trainable)
- # Create moving_mean and moving_variance variables and add them to the
- # appropriate collections. We disable variable partitioning while creating
- # them, because assign_moving_average is not yet supported for partitioned
- # variables.
- partitioner = variable_scope.get_variable_scope().partitioner
- try:
- variable_scope.get_variable_scope().set_partitioner(None)
- moving_mean_collections = utils.get_variable_collections(
- variables_collections, 'moving_mean')
- moving_mean_initializer = param_initializers.get(
- 'moving_mean', init_ops.zeros_initializer())
- moving_mean = variables.model_variable(
- 'moving_mean',
- shape=params_shape,
- dtype=dtype,
- initializer=moving_mean_initializer,
- trainable=False,
- collections=moving_mean_collections)
- moving_variance_collections = utils.get_variable_collections(
- variables_collections, 'moving_variance')
- moving_variance_initializer = param_initializers.get(
- 'moving_variance', init_ops.ones_initializer())
- moving_variance = variables.model_variable(
- 'moving_variance',
- shape=params_shape,
- dtype=dtype,
- initializer=moving_variance_initializer,
- trainable=False,
- collections=moving_variance_collections)
- finally:
- variable_scope.get_variable_scope().set_partitioner(partitioner)
- # If `is_training` doesn't have a constant value, because it is a `Tensor`,
- # a `Variable` or `Placeholder` then is_training_value will be None and
- # `needs_moments` will be true.
- is_training_value = utils.constant_value(is_training)
- need_moments = is_training_value is None or is_training_value
- if need_moments:
- # Calculate the moments based on the individual batch.
- if batch_weights is None:
- if data_format == DATA_FORMAT_NCHW:
- mean, variance = nn.moments(inputs, moments_axes, keep_dims=True)
- mean = array_ops.reshape(mean, [-1])
- variance = array_ops.reshape(variance, [-1])
- else:
- mean, variance = nn.moments(inputs, moments_axes)
- else:
- if data_format == DATA_FORMAT_NCHW:
- mean, variance = nn.weighted_moments(inputs, moments_axes,
- batch_weights, keep_dims=True)
- mean = array_ops.reshape(mean, [-1])
- variance = array_ops.reshape(variance, [-1])
- else:
- mean, variance = nn.weighted_moments(inputs, moments_axes,
- batch_weights)
- moving_vars_fn = lambda: (moving_mean, moving_variance)
- if updates_collections is None:
- def _force_updates():
- """Internal function forces updates moving_vars if is_training."""
- update_moving_mean = moving_averages.assign_moving_average(
- moving_mean, mean, decay, zero_debias=zero_debias_moving_mean)
- update_moving_variance = moving_averages.assign_moving_average(
- moving_variance, variance, decay, zero_debias=False)
- with ops.control_dependencies([update_moving_mean,
- update_moving_variance]):
- return array_ops.identity(mean), array_ops.identity(variance)
- mean, variance = utils.smart_cond(is_training,
- _force_updates,
- moving_vars_fn)
- else:
- def _delay_updates():
- """Internal function that delay updates moving_vars if is_training."""
- update_moving_mean = moving_averages.assign_moving_average(
- moving_mean, mean, decay, zero_debias=zero_debias_moving_mean)
- update_moving_variance = moving_averages.assign_moving_average(
- moving_variance, variance, decay, zero_debias=False)
- return update_moving_mean, update_moving_variance
- update_mean, update_variance = utils.smart_cond(is_training,
- _delay_updates,
- moving_vars_fn)
- ops.add_to_collections(updates_collections, update_mean)
- ops.add_to_collections(updates_collections, update_variance)
- # Use computed moments during training and moving_vars otherwise.
- vars_fn = lambda: (mean, variance)
- mean, variance = utils.smart_cond(is_training, vars_fn, moving_vars_fn)
- else:
- mean, variance = moving_mean, moving_variance
- if data_format == DATA_FORMAT_NCHW:
- mean = array_ops.reshape(mean, params_shape_broadcast)
- variance = array_ops.reshape(variance, params_shape_broadcast)
- beta = array_ops.reshape(beta, params_shape_broadcast)
- if gamma is not None:
- gamma = array_ops.reshape(gamma, params_shape_broadcast)
- # Compute batch_normalization.
- outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma,
- epsilon)
- outputs.set_shape(inputs_shape)
- if activation_fn is not None:
- outputs = activation_fn(outputs)
- return utils.collect_named_outputs(outputs_collections,
- sc.original_name_scope, outputs)