shape函数_深度学习中Batch Normalization和Dice激活函数

最新推荐文章于 2023-03-23 11:46:46 发布

weixin_39559079

最新推荐文章于 2023-03-23 11:46:46 发布

阅读量253

点赞数

文章标签： shape函数

一、前言

最近在使用阿里的DIEN(Deep Interest Evolution Network for Click-Through Rate Prediction)模型优化我们自己的信息流推荐，阿里给了一个实现的github地址https://github.com/mouna99/dien，因为要使用tf serving部署服务，所以对代码做了重构，第一版上线后，没有明显的收益，再次review代码，发现里面有两个坑。其一是在做batch normalization的时候，训练的时候没有更新参数，其二就是Dice激活函数在预测的时候实现的有问题，修复后，auc有明显的提升(大概有1%的提升，在ctr预估场景，这个提升已经非常大了)。后面主要讲一下具体是怎么修复的。

二、Batch Normalization

Batch Normalization主要是用来解决Internal Covariate Shift(在神经网络的训练过程中，由于网络参数变化而引起隐藏层数据分布变化的过程)问题，Internal Covariate Shift主要会有两个方面的影响：

上层网络需要不断的调整参数来适应输入数据分布的变化，导致学习速度的降低
网络的训练过程不易收敛，容易形成梯度消失

Batch Normalization的提出缓解了上面两个问题

在训练阶段，变换公式如下：

其中X表示一个mini-batch的样本，

分别表示均值和方差。

在测试阶段，计算的公式如下：

其中

分别表示训练阶段每组batch的均值和方差.

可以看到，在训练阶段和测试阶段，Batch Normalization的计算是不一样的，我们再看一下tensorflow里面的函数原型：

@tf_export('layers.batch_normalization')
def batch_normalization(inputs,
                        axis=-1,
                        momentum=0.99,
                        epsilon=1e-3,
                        center=True,
                        scale=True,
                        beta_initializer=init_ops.zeros_initializer(),
                        gamma_initializer=init_ops.ones_initializer(),
                        moving_mean_initializer=init_ops.zeros_initializer(),
                        moving_variance_initializer=init_ops.ones_initializer(),
                        beta_regularizer=None,
                        gamma_regularizer=None,
                        beta_constraint=None,
                        gamma_constraint=None,
                        training=False,
                        trainable=True,
                        name=None,
                        reuse=None,
                        renorm=False,
                        renorm_clipping=None,
                        renorm_momentum=0.99,
                        fused=None,
                        virtual_batch_size=None,
                        adjustment=None):
  """Functional interface for the batch normalization layer.

  Reference: http://arxiv.org/abs/1502.03167

  "Batch Normalization: Accelerating Deep Network Training by Reducing
  Internal Covariate Shift"

  Sergey Ioffe, Christian Szegedy

  Note: when training, the moving_mean and moving_variance need to be updated.
  By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
  need to be added as a dependency to the `train_op`. Also, be sure to add
  any batch_normalization ops before getting the update_ops collection.
  Otherwise, update_ops will be empty, and training/inference will not work
  properly. For example:

可以看到，里面的参数training表示是否为训练阶段，默认为False，也就是说如果我们不显示指定这个参数，Batch Normalization在我们的模型中是没有起任何作用。修改后的使用方式如下：

is_training = mode == model_fn.ModeKeys.TRAIN
hidden_layer = tf.layers.batch_normalization(inputs=hidden_layer, training=is_training)

测试了一下在正确设置training后，学习率在0.7的前提下，auc从0.682提升到0.687。

三、Dice激活函数

Dice激活函数也是用来解决Internal Covariate Shift问题，和Batch Normalization有异曲同工之妙，定义如下：

我们将这个公式变化一下，设置

，结果如下：

可以看到，Dice就是在BN上做了变换，github上有Dice的实现，代码如下：

def dice(_x, axis=-1, epsilon=0.000000001, name=''):
  with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
    alphas = tf.get_variable('alpha'+name, _x.get_shape()[-1],
                         initializer=tf.constant_initializer(0.0),
                         dtype=tf.float32)
    input_shape = list(_x.get_shape())

    reduction_axes = list(range(len(input_shape)))
    del reduction_axes[axis]
    broadcast_shape = [1] * len(input_shape)
    broadcast_shape[axis] = input_shape[axis]

  # case: train mode (uses stats of the current batch)
  mean = tf.reduce_mean(_x, axis=reduction_axes)
  brodcast_mean = tf.reshape(mean, broadcast_shape)
  std = tf.reduce_mean(tf.square(_x - brodcast_mean) + epsilon, axis=reduction_axes)
  std = tf.sqrt(std)
  brodcast_std = tf.reshape(std, broadcast_shape)
  x_normed = (_x - brodcast_mean) / (brodcast_std + epsilon)
  # x_normed = tf.layers.batch_normalization(_x, center=False, scale=False)
  x_p = tf.sigmoid(x_normed)

可以看到，其中BN的只实现了训练阶段，预测阶段没有实现，这样会导致相同的N条数据，按不同的batch-size作为输入，预估得分不一样，我们把它做一下修改：

def Dice(_x, axis=-1, epsilon=0.000000001, name='dice', training=True):
    alphas = tf.get_variable('alpha_'+name, _x.get_shape()[-1],
            initializer=tf.constant_initializer(0.0),
            dtype=tf.float32)
    inputs_normed = tf.layers.batch_normalization(
            inputs=_x, 
            axis=axis, 
            epsilon=epsilon, 
            center=False, 
            scale=False, 
            training=training)
    x_p = tf.sigmoid(inputs_normed)
    return alphas * (1.0 - x_p) * _x + x_p * _x

修改后，相同的N条数据，按不同的batch-size作为输入，预估得分保持一致，模型的auc大概能提升0.4%。

四、总结

深度学习里面有很多解决梯度消失和非线性变化的激活函数，还有一些解决过拟合的方法，tensorflow都做了很好的封装，使用起来很方便，但是要注意里面的一些参数设置，了解清楚实现的原理，要不然很容易采坑，调试起来也相当麻烦。
大多数论文都有开源实现的demo，但是真正要运用到具体的业务，还需要清楚其实现原理，一些坑只有上线了才能发现
额外补充：深度学习里面，很多时候会遇到高维稀疏特征，在广告和推荐场景很常见，在对ID特征做embedding的时候，embedding的大小很关键，之前有同学尝试过调整embedding size，离线Auc和线上指标都有不错的收益。

五、Reference

[Zhou et al. 2018c] Zhou, G.; Zhu, X.; Song, C.; Fan, Y.;Zhu, H.; Ma, X.; Yan, Y.; Jin, J.; Li, H.; and Gai, K. 2018c. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1059–1068. ACM.
github.com/mouna99/dien.
天雨粟：Batch Normalization原理与实战

weixin_39559079

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
shape函数_深度学习中Batch Normalization和Dice激活函数

一、前言最近在使用阿里的DIEN(Deep Interest Evolution Network for Click-Through Rate Prediction)模型优化我们自己的信息流推荐，阿里给了一个实现的github地址https://github.com/mouna99/dien，因为要使用tf serving部署服务，所以对代码做了重构，第一版上线后，没有明显的收益，再次re...
复制链接

扫一扫