tensorflow：批标准化(Bacth Normalization，BN)

最新推荐文章于 2024-04-08 11:15:14 发布

千寻～

最新推荐文章于 2024-04-08 11:15:14 发布

阅读量1w

点赞数 5

分类专栏：深度学习文章标签：批标准化 Bacth Normalization BN tensorflow 神经网络

深度学习专栏收录该内容

18 篇文章 2 订阅

订阅专栏

批标准化（Bactch Normalization，BN）是为了克服神经网络加深导致难以训练而诞生的，随着神经网络深度加深，训练起来就会越来越困难，收敛速度回很慢，常常会导致梯度弥散问题(Vanishing Gradient Problem)。

统计机器学习中有一个经典的假设:Source Domain 和 Target Domain的数据分布是一致的。也就是说，训练数据和测试数据是满足相同分布的。这是通过训练数据获得的模型能够在测试集上获得好的效果的一个基本保障。

Convariate Shift是指训练集的样本数据和目标样本集分布不一致时，训练得到的模型无法很好的Generalization。它是分布不一致假设之下的一个分支问题，也就是指Sorce Domain和Target Domain的条件概率一致的，但是其边缘概率不同。的确，对于神经网络的各层输出，在经过了层内操作后，各层输出分布就会与对应的输入信号分布不同，而且差异会随着网络深度增大而加大了，但每一层所指向的Label仍然是不变的。

解决办法：一般是根据训练样本和目标样本的比例对训练样本做一个矫正。所以，通过引入Bactch Normalization来标准化某些层或者所有层的输入，从而固定每层输入信息的均值和方差。

方法：Bactch Normalization一般用在非线性映射(激活函数)之前，对x=Wu+b做标准化，是结果(输出信号各个维度)的均值为0，方差为1。让每一层的输入有一个稳定的分布会有利于网络的训练。

优点：Bactch Normalization通过标准化让激活函数分布在线性区间，结果就是加大了梯度，让模型更大胆的进行梯度下降，具有如下优点：

加大搜索的步长，加快收敛的速度；
更容易跳出局部最小值；
破坏原来的数据分布，一定程度上缓解了过拟合；

因此，在遇到神经网络收敛速度很慢或梯度爆炸(Gradient Explore)等无法训练的情况系啊，都可以尝试用Bactch Normalization来解决。

梯度爆炸：梯度非常大，链式求导后乘积就变得很大，使权重变得非常大，产生指数级爆炸。

在GANs中我们充分利用bactch normalization的优点，加快模型的收敛速度。

batch normalization源码位于:

C:\Anaconda3\envs\tensorflow\Lib\site-packages\tensorflow\python\ops\nn_impl.py

调用方法：tf.nn.batch_normalization()

[python] view plain copy 
   
 print?
 def batch_normalization(x,  
                         mean,  
                         variance,  
                         offset,  
                         scale,  
                         variance_epsilon,  
                         name=None):  
   r"""Batch normalization. 
  
   As described in http://arxiv.org/abs/1502.03167. 
   Normalizes a tensor by `mean` and `variance`, and applies (optionally) a 
   `scale` \\(\gamma\\) to it, as well as an `offset` \\(\beta\\): 
  
   \\(\frac{\gamma(x-\mu)}{\sigma}+\beta\\) 
  
   `mean`, `variance`, `offset` and `scale` are all expected to be of one of two 
   shapes: 
  
     * In all generality, they can have the same number of dimensions as the 
       input `x`, with identical sizes as `x` for the dimensions that are not 
       normalized over (the 'depth' dimension(s)), and dimension 1 for the 
       others which are being normalized over. 
       `mean` and `variance` in this case would typically be the outputs of 
       `tf.nn.moments(..., keep_dims=True)` during training, or running averages 
       thereof during inference. 
     * In the common case where the 'depth' dimension is the last dimension in 
       the input tensor `x`, they may be one dimensional tensors of the same 
       size as the 'depth' dimension. 
       This is the case for example for the common `[batch, depth]` layout of 
       fully-connected layers, and `[batch, height, width, depth]` for 
       convolutions. 
       `mean` and `variance` in this case would typically be the outputs of 
       `tf.nn.moments(..., keep_dims=False)` during training, or running averages 
       thereof during inference. 
  
   Args: 
     x: Input `Tensor` of arbitrary dimensionality. 
     mean: A mean `Tensor`. 
     variance: A variance `Tensor`. 
     offset: An offset `Tensor`, often denoted \\(\beta\\) in equations, or 
       None. If present, will be added to the normalized tensor. 
     scale: A scale `Tensor`, often denoted \\(\gamma\\) in equations, or 
       `None`. If present, the scale is applied to the normalized tensor. 
     variance_epsilon: A small float number to avoid dividing by 0. 
     name: A name for this operation (optional). 
  
   Returns: 
     the normalized, scaled, offset tensor. 
   """  
   with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):  
     inv = math_ops.rsqrt(variance + variance_epsilon)  
     if scale is not None:  
       inv *= scale  
     return x * inv + (offset - mean * inv  
                       if offset is not None else -mean * inv)  

C:\Anaconda3\envs\tensorflow\Lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py

调用方法：tf.contrib.layers.batch_norm

[python] view plain copy 
   
 print?
 def batch_norm(inputs,  
                decay=0.999,  
                center=True,  
                scale=False,  
                epsilon=0.001,  
                activation_fn=None,  
                param_initializers=None,  
                param_regularizers=None,  
                updates_collections=ops.GraphKeys.UPDATE_OPS,  
                is_training=True,  
                reuse=None,  
                variables_collections=None,  
                outputs_collections=None,  
                trainable=True,  
                batch_weights=None,  
                fused=False,  
                data_format=DATA_FORMAT_NHWC,  
                zero_debias_moving_mean=False,  
                scope=None,  
                renorm=False,  
                renorm_clipping=None,  
                renorm_decay=0.99):  
   """Adds a Batch Normalization layer from http://arxiv.org/abs/1502.03167. 
  
     "Batch Normalization: Accelerating Deep Network Training by Reducing 
     Internal Covariate Shift" 
  
     Sergey Ioffe, Christian Szegedy 
  
   Can be used as a normalizer function for conv2d and fully_connected. 
  
   Note: When is_training is True the moving_mean and moving_variance need to be 
   updated, by default the update_ops are placed in `tf.GraphKeys.UPDATE_OPS` so 
   they need to be added as a dependency to the `train_op`, example: 
  
     update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 
     with tf.control_dependencies(update_ops): 
       train_op = optimizer.minimize(loss) 
  
   One can set updates_collections=None to force the updates in place, but that 
   can have speed penalty, especially in distributed settings. 
  
   Args: 
     inputs: A tensor with 2 or more dimensions, where the first dimension has 
       `batch_size`. The normalization is over all but the last dimension if 
       `data_format` is `NHWC` and the second dimension if `data_format` is 
       `NCHW`. 
     decay: Decay for the moving average. Reasonable values for `decay` are close 
       to 1.0, typically in the multiple-nines range: 0.999, 0.99, 0.9, etc. 
       Lower `decay` value (recommend trying `decay`=0.9) if model experiences 
       reasonably good training performance but poor validation and/or test 
       performance. Try zero_debias_moving_mean=True for improved stability. 
     center: If True, add offset of `beta` to normalized tensor. If False, `beta` 
       is ignored. 
     scale: If True, multiply by `gamma`. If False, `gamma` is 
       not used. When the next layer is linear (also e.g. `nn.relu`), this can be 
       disabled since the scaling can be done by the next layer. 
     epsilon: Small float added to variance to avoid dividing by zero. 
     activation_fn: Activation function, default set to None to skip it and 
       maintain a linear activation. 
     param_initializers: Optional initializers for beta, gamma, moving mean and 
       moving variance. 
     param_regularizers: Optional regularizer for beta and gamma. 
     updates_collections: Collections to collect the update ops for computation. 
       The updates_ops need to be executed with the train_op. 
       If None, a control dependency would be added to make sure the updates are 
       computed in place. 
     is_training: Whether or not the layer is in training mode. In training mode 
       it would accumulate the statistics of the moments into `moving_mean` and 
       `moving_variance` using an exponential moving average with the given 
       `decay`. When it is not in training mode then it would use the values of 
       the `moving_mean` and the `moving_variance`. 
     reuse: Whether or not the layer and its variables should be reused. To be 
       able to reuse the layer scope must be given. 
     variables_collections: Optional collections for the variables. 
     outputs_collections: Collections to add the outputs. 
     trainable: If `True` also add variables to the graph collection 
       `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`). 
     batch_weights: An optional tensor of shape `[batch_size]`, 
       containing a frequency weight for each batch item. If present, 
       then the batch normalization uses weighted mean and 
       variance. (This can be used to correct for bias in training 
       example selection.) 
     fused:  Use nn.fused_batch_norm if True, nn.batch_normalization otherwise. 
     data_format: A string. `NHWC` (default) and `NCHW` are supported. 
     zero_debias_moving_mean: Use zero_debias for moving_mean. It creates a new 
       pair of variables 'moving_mean/biased' and 'moving_mean/local_step'. 
     scope: Optional scope for `variable_scope`. 
     renorm: Whether to use Batch Renormalization 
       (https://arxiv.org/abs/1702.03275). This adds extra variables during 
       training. The inference is the same for either value of this parameter. 
     renorm_clipping: A dictionary that may map keys 'rmax', 'rmin', 'dmax' to 
       scalar `Tensors` used to clip the renorm correction. The correction 
       `(r, d)` is used as `corrected_value = normalized_value * r + d`, with 
       `r` clipped to [rmin, rmax], and `d` to [-dmax, dmax]. Missing rmax, rmin, 
       dmax are set to inf, 0, inf, respectively. 
     renorm_decay: Momentum used to update the moving means and standard 
       deviations with renorm. Unlike `momentum`, this affects training 
       and should be neither too small (which would add noise) nor too large 
       (which would give stale estimates). Note that `decay` is still applied 
       to get the means and variances for inference. 
  
   Returns: 
     A `Tensor` representing the output of the operation. 
  
   Raises: 
     ValueError: If `batch_weights` is not None and `fused` is True. 
     ValueError: If `param_regularizers` is not None and `fused` is True. 
     ValueError: If `data_format` is neither `NHWC` nor `NCHW`. 
     ValueError: If the rank of `inputs` is undefined. 
     ValueError: If rank or channels dimension of `inputs` is undefined. 
   """  
   if fused:  
     if batch_weights is not None:  
       raise ValueError('Weighted mean and variance is not currently '  
                        'supported for fused batch norm.')  
     if param_regularizers is not None:  
       raise ValueError('Regularizers are not currently '  
                        'supported for fused batch norm.')  
     if renorm:  
       raise ValueError('Renorm is not supported for fused batch norm.')  
     return _fused_batch_norm(  
         inputs,  
         decay=decay,  
         center=center,  
         scale=scale,  
         epsilon=epsilon,  
         activation_fn=activation_fn,  
         param_initializers=param_initializers,  
         updates_collections=updates_collections,  
         is_training=is_training,  
         reuse=reuse,  
         variables_collections=variables_collections,  
         outputs_collections=outputs_collections,  
         trainable=trainable,  
         data_format=data_format,  
         zero_debias_moving_mean=zero_debias_moving_mean,  
         scope=scope)  
   
   if data_format not in (DATA_FORMAT_NCHW, DATA_FORMAT_NHWC):  
     raise ValueError('data_format has to be either NCHW or NHWC.')  
   
   layer_variable_getter = _build_variable_getter()  
   with variable_scope.variable_scope(  
       scope, 'BatchNorm', [inputs], reuse=reuse,  
       custom_getter=layer_variable_getter) as sc:  
     inputs = ops.convert_to_tensor(inputs)  
   
     # Determine whether we can use the core layer class.  
     if (batch_weights is None and  
         updates_collections is ops.GraphKeys.UPDATE_OPS and  
         not zero_debias_moving_mean):  
       # Use the core layer class.  
       axis = 1 if data_format == DATA_FORMAT_NCHW else -1  
       if not param_initializers:  
         param_initializers = {}  
       beta_initializer = param_initializers.get('beta',  
                                                 init_ops.zeros_initializer())  
       gamma_initializer = param_initializers.get('gamma',  
                                                  init_ops.ones_initializer())  
       moving_mean_initializer = param_initializers.get(  
           'moving_mean', init_ops.zeros_initializer())  
       moving_variance_initializer = param_initializers.get(  
           'moving_variance', init_ops.ones_initializer())  
       if not param_regularizers:  
         param_regularizers = {}  
       beta_regularizer = param_regularizers.get('beta')  
       gamma_regularizer = param_regularizers.get('gamma')  
       layer = normalization_layers.BatchNormalization(  
           axis=axis,  
           momentum=decay,  
           epsilon=epsilon,  
           center=center,  
           scale=scale,  
           beta_initializer=beta_initializer,  
           gamma_initializer=gamma_initializer,  
           moving_mean_initializer=moving_mean_initializer,  
           moving_variance_initializer=moving_variance_initializer,  
           beta_regularizer=beta_regularizer,  
           gamma_regularizer=gamma_regularizer,  
           trainable=trainable,  
           renorm=renorm,  
           renorm_clipping=renorm_clipping,  
           renorm_momentum=renorm_decay,  
           name=sc.name,  
           _scope=sc,  
           _reuse=reuse)  
       outputs = layer.apply(inputs, training=is_training)  
   
       # Add variables to collections.  
       _add_variable_to_collections(  
           layer.moving_mean, variables_collections, 'moving_mean')  
       _add_variable_to_collections(  
           layer.moving_variance, variables_collections, 'moving_variance')  
       if layer.beta:  
         _add_variable_to_collections(layer.beta, variables_collections, 'beta')  
       if layer.gamma:  
         _add_variable_to_collections(  
             layer.gamma, variables_collections, 'gamma')  
   
       if activation_fn is not None:  
         outputs = activation_fn(outputs)  
       return utils.collect_named_outputs(outputs_collections,  
                                          sc.original_name_scope, outputs)  
   
     # Not supported by layer class: batch_weights argument,  
     # and custom updates_collections. In that case, use the legacy BN  
     # implementation.  
     # Custom updates collections are not supported because the update logic  
     # is different in this case, in particular w.r.t. "forced updates" and  
     # update op reuse.  
     if renorm:  
       raise ValueError('renorm is not supported with batch_weights, '  
                        'updates_collections or zero_debias_moving_mean')  
     inputs_shape = inputs.get_shape()  
     inputs_rank = inputs_shape.ndims  
     if inputs_rank is None:  
       raise ValueError('Inputs %s has undefined rank.' % inputs.name)  
     dtype = inputs.dtype.base_dtype  
     if batch_weights is not None:  
       batch_weights = ops.convert_to_tensor(batch_weights)  
       inputs_shape[0:1].assert_is_compatible_with(batch_weights.get_shape())  
       # Reshape batch weight values so they broadcast across inputs.  
       nshape = [-1] + [1 for _ in range(inputs_rank - 1)]  
       batch_weights = array_ops.reshape(batch_weights, nshape)  
   
     if data_format == DATA_FORMAT_NCHW:  
       moments_axes = [0] + list(range(2, inputs_rank))  
       params_shape = inputs_shape[1:2]  
       # For NCHW format, rather than relying on implicit broadcasting, we  
       # explicitly reshape the params to params_shape_broadcast when computing  
       # the moments and the batch normalization.  
       params_shape_broadcast = list(  
           [1, inputs_shape[1].value] + [1 for _ in range(2, inputs_rank)])  
     else:  
       moments_axes = list(range(inputs_rank - 1))  
       params_shape = inputs_shape[-1:]  
       params_shape_broadcast = None  
     if not params_shape.is_fully_defined():  
       raise ValueError('Inputs %s has undefined channels dimension %s.' % (  
           inputs.name, params_shape))  
   
     # Allocate parameters for the beta and gamma of the normalization.  
     beta, gamma = None, None  
     if not param_initializers:  
       param_initializers = {}  
     if center:  
       beta_collections = utils.get_variable_collections(variables_collections,  
                                                         'beta')  
       beta_initializer = param_initializers.get('beta',  
                                                 init_ops.zeros_initializer())  
       beta = variables.model_variable('beta',  
                                       shape=params_shape,  
                                       dtype=dtype,  
                                       initializer=beta_initializer,  
                                       collections=beta_collections,  
                                       trainable=trainable)  
     if scale:  
       gamma_collections = utils.get_variable_collections(variables_collections,  
                                                          'gamma')  
       gamma_initializer = param_initializers.get('gamma',  
                                                  init_ops.ones_initializer())  
       gamma = variables.model_variable('gamma',  
                                        shape=params_shape,  
                                        dtype=dtype,  
                                        initializer=gamma_initializer,  
                                        collections=gamma_collections,  
                                        trainable=trainable)  
   
     # Create moving_mean and moving_variance variables and add them to the  
     # appropriate collections. We disable variable partitioning while creating  
     # them, because assign_moving_average is not yet supported for partitioned  
     # variables.  
     partitioner = variable_scope.get_variable_scope().partitioner  
     try:  
       variable_scope.get_variable_scope().set_partitioner(None)  
       moving_mean_collections = utils.get_variable_collections(  
           variables_collections, 'moving_mean')  
       moving_mean_initializer = param_initializers.get(  
           'moving_mean', init_ops.zeros_initializer())  
       moving_mean = variables.model_variable(  
           'moving_mean',  
           shape=params_shape,  
           dtype=dtype,  
           initializer=moving_mean_initializer,  
           trainable=False,  
           collections=moving_mean_collections)  
       moving_variance_collections = utils.get_variable_collections(  
           variables_collections, 'moving_variance')  
       moving_variance_initializer = param_initializers.get(  
           'moving_variance', init_ops.ones_initializer())  
       moving_variance = variables.model_variable(  
           'moving_variance',  
           shape=params_shape,  
           dtype=dtype,  
           initializer=moving_variance_initializer,  
           trainable=False,  
           collections=moving_variance_collections)  
     finally:  
       variable_scope.get_variable_scope().set_partitioner(partitioner)  
   
     # If `is_training` doesn't have a constant value, because it is a `Tensor`,  
     # a `Variable` or `Placeholder` then is_training_value will be None and  
     # `needs_moments` will be true.  
     is_training_value = utils.constant_value(is_training)  
     need_moments = is_training_value is None or is_training_value  
     if need_moments:  
       # Calculate the moments based on the individual batch.  
       if batch_weights is None:  
         if data_format == DATA_FORMAT_NCHW:  
           mean, variance = nn.moments(inputs, moments_axes, keep_dims=True)  
           mean = array_ops.reshape(mean, [-1])  
           variance = array_ops.reshape(variance, [-1])  
         else:  
           mean, variance = nn.moments(inputs, moments_axes)  
       else:  
         if data_format == DATA_FORMAT_NCHW:  
           mean, variance = nn.weighted_moments(inputs, moments_axes,  
                                                batch_weights, keep_dims=True)  
           mean = array_ops.reshape(mean, [-1])  
           variance = array_ops.reshape(variance, [-1])  
         else:  
           mean, variance = nn.weighted_moments(inputs, moments_axes,  
                                                batch_weights)  
   
       moving_vars_fn = lambda: (moving_mean, moving_variance)  
       if updates_collections is None:  
         def _force_updates():  
           """Internal function forces updates moving_vars if is_training."""  
           update_moving_mean = moving_averages.assign_moving_average(  
               moving_mean, mean, decay, zero_debias=zero_debias_moving_mean)  
           update_moving_variance = moving_averages.assign_moving_average(  
               moving_variance, variance, decay, zero_debias=False)  
           with ops.control_dependencies([update_moving_mean,  
                                          update_moving_variance]):  
             return array_ops.identity(mean), array_ops.identity(variance)  
         mean, variance = utils.smart_cond(is_training,  
                                           _force_updates,  
                                           moving_vars_fn)  
       else:  
         def _delay_updates():  
           """Internal function that delay updates moving_vars if is_training."""  
           update_moving_mean = moving_averages.assign_moving_average(  
               moving_mean, mean, decay, zero_debias=zero_debias_moving_mean)  
           update_moving_variance = moving_averages.assign_moving_average(  
               moving_variance, variance, decay, zero_debias=False)  
           return update_moving_mean, update_moving_variance  
   
         update_mean, update_variance = utils.smart_cond(is_training,  
                                                         _delay_updates,  
                                                         moving_vars_fn)  
         ops.add_to_collections(updates_collections, update_mean)  
         ops.add_to_collections(updates_collections, update_variance)  
         # Use computed moments during training and moving_vars otherwise.  
         vars_fn = lambda: (mean, variance)  
         mean, variance = utils.smart_cond(is_training, vars_fn, moving_vars_fn)  
     else:  
       mean, variance = moving_mean, moving_variance  
     if data_format == DATA_FORMAT_NCHW:  
       mean = array_ops.reshape(mean, params_shape_broadcast)  
       variance = array_ops.reshape(variance, params_shape_broadcast)  
       beta = array_ops.reshape(beta, params_shape_broadcast)  
       if gamma is not None:  
         gamma = array_ops.reshape(gamma, params_shape_broadcast)  
   
     # Compute batch_normalization.  
     outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma,  
                                      epsilon)  
     outputs.set_shape(inputs_shape)  
     if activation_fn is not None:  
       outputs = activation_fn(outputs)  
     return utils.collect_named_outputs(outputs_collections,  
                                        sc.original_name_scope, outputs)