关于tensorflow中layernormalization

众所周知循环神经网络不是十分适合LayerNormalization

关于LayerNormalization:https://blog.csdn.net/liuxiao214/article/details/81037416

LayerNormalization中同层神经元输入拥有相同的均值和方差,不同的输入样本有不同的均值和方差;即假设输入为图片,假设某一层的shape为(m, h, w, c) 其中m为批量大小,h为高,w为宽,c为通道数。

经过LayerNormalization即应用公式(x-mean)/std。 x就是输入(m, h, w, c),而这个mean的shape为(m,), std的shape为(m,) ,这样会保证每个样本有不同的均值和方差,同时完成了归一化。

而对于循环神经网络来说,假设输入为(m, t, feature),t表示时间步,那么mean的shape是什么?std的mean是什么?

依照论文,mean的shape为(m, t),std的shape为(m, t),可以知道各个样本的各个时间步有自己的均值和标准差。

实验验证:

1 tensorflow 1.x版本

观察tensorflow1.14文档。

定义为

tf.contrib.layers.layer_norm(
    inputs,
    center=True,
    scale=True,
    activation_fn=None,
    reuse=None,
    variables_collections=None,
    outputs_collections=None,
    trainable=True,
    begin_norm_axis=1,
    begin_params_axis=-1,
    scope=None
)

解释为

Given a tensor inputs of rank R(秩为R), moments are calculated and normalization is performed over axes begin_norm_axis ... R - 1. Scaling and centering, if requested, is performed over axes begin_params_axis .. R - 1.

By default, begin_norm_axis = 1 and begin_params_axis = -1, meaning that normalization is performed over all but the first axis (the HWC if inputs is NHWC), while the beta and gamma trainable parameters are calculated for the rightmost axis (the C if inputs is NHWC). Scaling and recentering is performed via broadcast of the beta and gamma parameters with the normalized tensor.

The shapes of beta and gamma are inputs.shape[begin_params_axis:], and this part of the inputs' shape must be fully defined.

看了这个解释,按照默认参数的话,对于图片而言,假设输入shape为(m, h, w, c),则求得的均值shape为(m, ),因为begin_norm_axis默认为1。但是对于循环神经网络而言,应该赋值begin_norm_axis = -1,这样计算得到的mean的shape为(m, t)。

实验验证(tf 1.14):

import tensorflow as tf

x1 = tf.convert_to_tensor(
    [[[18.369314, 2.6570225, 20.402943],
      [10.403599, 2.7813416, 20.794857]],
     [[19.0327, 2.6398268, 6.3894367],
      [3.921237, 10.761424, 2.7887821]],
     [[11.466338, 20.210938, 8.242946],
      [22.77081, 11.555874, 11.183836]],
     [[8.976935, 10.204252, 11.20231],
      [-7.356888, 6.2725096, 1.1952505]]])
mean_x = tf.reduce_mean(x1, axis=-1)
print(mean_x.shape)  # (4, 2)
mean_x = tf.expand_dims(mean_x, -1)

std_x = tf.math.reduce_std(x1, axis=-1)
print(std_x.shape)  # (4, 2)
std_x = tf.expand_dims(std_x, -1)

# 手动计算
la_no1 = (x1-mean_x)/std_x

x = tf.placeholder(tf.float32, shape=[4, 2, 3])
la_no = tf.contrib.layers.layer_norm(
      inputs=x, begin_norm_axis=-1, begin_params_axis=-1)
with tf.Session() as sess1:
    sess1.run(tf.global_variables_initializer())
    x1 = sess1.run(x1)

    # 手动计算
    print(sess1.run(la_no1))
    '''
    [[[ 0.5749929  -1.4064412   0.83144826]
    [-0.1250188  -1.1574404   1.2824593 ]]

    [[ 1.3801126  -0.9573896  -0.422723  ]
    [-0.5402143   1.4019758  -0.86176145]]

    [[-0.36398557  1.3654773  -1.0014919 ]
    [ 1.4136491  -0.6722269  -0.74142253]]

    [[-1.2645671   0.08396867  1.1806016 ]
    [-1.3146634   1.108713    0.20595042]]]
    '''

    # tensorflow实现
    print(sess1.run(la_no, feed_dict={x: x1}))
    '''
    [[[ 0.574993   -1.4064413   0.8314482 ]
    [-0.12501884 -1.1574404   1.2824591 ]]

    [[ 1.3801126  -0.9573896  -0.422723  ]
    [-0.5402143   1.4019756  -0.86176145]]

    [[-0.36398554  1.3654773  -1.0014919 ]
    [ 1.4136491  -0.67222667 -0.7414224 ]]

    [[-1.2645674   0.08396816  1.1806011 ]
    [-1.3146634   1.108713    0.20595042]]]
    '''

可以发现 二者结果一致。即假设RNN系列输入为x(m, t, feature),应用公式(x-mean)/std 时 ,mean的shape为(m,t),std的shape为(m, t)。

2 tensorflow 2.x版本

tensorflow升级到2.x 发现 tf.contrib已经消失

观察2.1.0 文档

tf.keras.layers.LayerNormalization(
    axis=-1, epsilon=0.001, center=True, scale=True, beta_initializer='zeros',
    gamma_initializer='ones', beta_regularizer=None, gamma_regularizer=None,
    beta_constraint=None, gamma_constraint=None, trainable=True, name=None, **kwargs
)

发现好多参数和1.x不一样。

官方解释

Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.

重点是参数axis

axis: Integer or List/Tuple. The axis that should be normalized (typically the features axis).

这个axis默认为-1,给人一种错觉容易和Batchnormalization混淆(其axis默认也为-1),这里假设输入为(m, t, feature)。

那么axis=-1的话,求得的mean的shape为(m, t)  即对每一个样本的每一个时间步求均值。

实验验证(tf 2.1.0)

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

tf.random.set_seed(1234)

x1 = tf.random.normal((4, 2, 3), mean=10, stddev=10.0)
tf.print(x1)
'''
[[[18.3693142 2.65702248 20.4029427]
  [10.4035988 2.78134155 20.794857]]

 [[19.0327 2.63982677 6.38943672]
  [3.92123699 10.7614241 2.78878212]]

 [[11.4663382 20.2109375 8.24294567]
  [22.7708092 11.5558739 11.183836]]

 [[8.97693539 10.2042522 11.2023096]
  [-7.35688782 6.27250957 1.19525051]]]
'''

# tensorflow实现
x2 = tf.keras.layers.LayerNormalization()(x1)
tf.print(x2)
'''
[[[0.574988365 -1.40643 0.8314417]
  [-0.125017628 -1.15742981 1.28244758]]

 [[1.38009858 -0.957379758 -0.422718644]
  [-0.540192485 1.40191901 -0.861726582]]

 [[-0.363978326 1.36545086 -1.00147223]
  [1.41362476 -0.672215044 -0.74140954]]

 [[-1.26380491 0.0839173347 1.17988873]
  [-1.31464267 1.10869551 0.205947176]]]

'''

x_mean = tf.reduce_mean(x1, axis=-1)
x_mean = tf.expand_dims(x_mean, -1)

x_std = tf.math.reduce_std(x1, axis=-1)
x_std = tf.expand_dims(x_std, -1)

# 手动实现
x3 = (x1-x_mean)/x_std
tf.print(x3)
'''
[[[0.574992955 -1.40644133 0.831448257]
  [-0.125018805 -1.15744042 1.28245926]]

 [[1.38011265 -0.957389593 -0.422723]
  [-0.5402143 1.40197563 -0.861761391]]

 [[-0.363985568 1.36547732 -1.0014919]
  [1.41364908 -0.672226906 -0.741422534]]

 [[-1.2645669 0.0839686543 1.18060148]
  [-1.31466341 1.10871303 0.205950424]]]
'''

可以发现 结果基本一致 细微差别是因为epsilon(防止除以0)等的引入。

但是对于图片而言,不能使用默认的axis,而应该设置axis=(1, 2, 3)这样计算出的mean的shape为(m, )。

实验验证:

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

tf.random.set_seed(1234)

x1 = tf.random.normal((4, 2, 2, 3), mean=10, stddev=10.0)

x2 = tf.keras.layers.LayerNormalization(axis=(1, 2, 3))(x1)
# tensorflow实现
tf.print(x2)
'''
[[[[1.13592184 -1.01682484 1.41455]
   [0.0445363373 -0.999791801 1.46824622]]

  [[1.22681248 -1.01918077 -0.505445421]
   [-0.84361434 0.0935621038 -0.998772383]]]


 [[[0.241638556 1.41170228 -0.189664766]
   [1.75422382 0.253618807 0.203838587]]

  [[-0.0914538875 0.0727662146 0.206310436]
   [-2.27698731 -0.453317314 -1.13267565]]]


 [[[0.660608232 -1.50688756 -1.25433147]
   [-0.108726673 0.251018792 1.11019969]]

  [[1.55131137 -0.91692245 1.37210977]
   [0.350336075 -0.651512 -0.857205331]]]


 [[[-1.13443768 0.891957879 -1.49989474]
   [0.853844702 2.05934501 -0.13168712]]

  [[-0.669010222 -1.08240855 0.9768731]
   [-0.31119889 -0.043616 0.0902313]]]]
'''

x_mean = tf.reduce_mean(x1, axis=(1, 2, 3))
# print(x_mean.shape)  # (10,)

x_mean = tf.reshape(x_mean, (-1, 1, 1, 1))
# print(x_mean.shape)  # (10, 1, 1, 1)


x_std = tf.math.reduce_std(x1, axis=(1, 2, 3))
x_std = tf.reshape(x_std, (-1, 1, 1, 1))

x3 = (x1-x_mean)/x_std
# 手动实现
tf.print(x3)  # 可以发现两种方式差不多
'''
[[[[1.13593256 -1.01683426 1.4145633]
   [0.0445368551 -0.99980104 1.46826017]]

  [[1.22682405 -1.01919019 -0.50545007]
   [-0.843622148 0.0935630798 -0.998781621]]]


 [[[0.241640702 1.41171491 -0.18966648]
   [1.75423956 0.253621072 0.203840405]]

  [[-0.091454722 0.0727668479 0.206312269]
   [-2.27700782 -0.453321397 -1.1326859]]]


 [[[0.660612881 -1.50689745 -1.25433969]
   [-0.108727202 0.25102067 1.11020732]]

  [[1.55132198 -0.916928411 1.37211931]
   [0.350338638 -0.651516199 -0.857210875]]]


 [[[-1.13444376 0.891962826 -1.49990284]
   [0.853849471 2.05935621 -0.131687731]]

  [[-0.669013798 -1.08241439 0.976878583]
   [-0.31120047 -0.0436161309 0.0902319]]]]
'''

可以发现 基本一致。

  • 11
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

忧郁的常凯申

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值