5. 前向/反向传播——Layer Normalization

参考资料

cs231n Course Materials: Backprop
Derivatives, Backpropagation, and Vectorization
cs231n Lecture 4:Neural Networks and Backpropagation
cs231n Assignment 2
笔记: Batch Normalization及其反向传播

5. Layer Normalization

其实Layer Normalization基本就是把Batch Normalization对输入X的第一个维度N做的事情作用在了第二个维度D上(不过可学习参数 γ \gamma γ β \beta β仍然是D维),所以实现的话其实只需要把输入转置一下,然后再适当的位置转置回去就差不多了(当然,因为是对第二维D求均值和方差,测试阶段也可以单独算,就不需要再使用滑动平均了)。

不过,这里还是推导一下。

前向传播

"""Forward pass for layer normalization.

    Input:
    - X: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - Y: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
"""

μ i = 1 D ∑ j = 1 D X i , j (5.1) \mu_{i}=\frac{1}{D}\sum_{j=1}^{D}{X_{i,j}}\tag{5.1} μi=D1j=1DXi,j(5.1)
σ i = 1 D ∑ j = 1 D ( X i , j − μ i ) 2 + ϵ (5.2) \sigma_i=\sqrt{\frac{1}{D}\sum_{j=1}^{D}{\left(X_{i,j}-\mu_i\right)^2}+\epsilon}\tag{5.2} σi=D1j=1D(Xi,jμi)2+ϵ (5.2)
X ^ i , j = X i , j − μ i σ i (5.3) \hat{X}_{i,j}=\frac{X_{i,j}-\mu_i}{\sigma_{i}}\tag{5.3} X^i,j=σiXi,jμi(5.3)
Y i , j = γ j X ^ i , j + β j (5.4) Y_{i,j}=\gamma_j\hat{X}_{i,j}+\beta_j\tag{5.4} Yi,j=γjX^i,j+βj(5.4)


反向传播

在这里插入图片描述

由式(5.4)得
∂ L ∂ β j = ∑ i ∂ L ∂ Y i , j ∂ Y i , j ∂ β j = ∑ i ∂ L ∂ Y i , j (5.5) \frac{\partial{L}}{\partial{\beta_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\frac{\partial{Y_{i,j}}}{\partial{\beta_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\tag{5.5} βjL=iYi,jLβjYi,j=iYi,jL(5.5)
∂ L ∂ γ j = ∑ i ∂ L ∂ Y i , j ∂ Y i , j ∂ γ j = ∑ i ∂ L ∂ Y i , j X ^ i , j (5.6) \frac{\partial{L}}{\partial{\gamma_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\frac{\partial{Y_{i,j}}}{\partial{\gamma_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}\hat{X}_{i,j}}\tag{5.6} γjL=iYi,jLγjYi,j=iYi,jLX^i,j(5.6)
∂ L ∂ X ^ i , j = ∂ L ∂ Y i , j ∂ Y i , j ∂ X ^ i , j = ∂ L ∂ Y i , j γ j (5.7) \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}=\frac{\partial{L}}{\partial{Y_{i,j}}}\frac{\partial{Y_{i,j}}}{\partial{\hat{X}_{i,j}}}=\frac{\partial{L}}{\partial{Y_{i,j}}}\gamma_j\tag{5.7} X^i,jL=Yi,jLX^i,jYi,j=Yi,jLγj(5.7)


由式(5.3)可得
∂ L ∂ σ ^ i = ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ σ ^ i = − ∑ j = 1 D ∂ L ∂ X ^ i , j X i , j − μ i ( σ i ^ ) 2 = − 1 σ i ∑ k = 1 D ∂ L ∂ X ^ i , k X ^ i , k (5.8) \begin{aligned}\frac{\partial{L}}{\partial{\hat{\sigma}_{i}}}&=\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\hat{\sigma}_{i}}}}\\&=-\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{X_{i,j}-\mu_i}{\left(\hat{\sigma_i}\right)^2}}\\&= -\frac{1}{\sigma_i}\sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}\hat{X}_{i,k}} \end{aligned}\tag{5.8} σ^iL=j=1DX^i,jLσ^iX^i,j=j=1DX^i,jL(σi^)2Xi,jμi=σi1k=1DX^i,kLX^i,k(5.8)
∂ L ∂ μ i = ∂ L ∂ σ ^ i ∂ σ ^ i ∂ μ i + ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ μ i = ∂ L ∂ σ ^ i 1 D ∑ j = 1 D − 2 ( X i , j − μ i ) 2 1 D ∑ j = 1 D ( X i , j − μ i ) 2 + ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ μ i = ∂ L ∂ σ ^ i 1 D ∑ j = 1 D ( μ i − X i , j ) 1 D ∑ j = 1 D ( X i , j − μ i ) 2 + ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ μ i = ∂ L ∂ σ ^ i 0 1 D ∑ j = 1 D ( X i , j − μ i ) 2 + ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ μ i = ∑ j = 1 D ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ μ i = ∑ j = 1 D ∂ L ∂ X ^ i , j ( − 1 σ ^ i ) = ∑ k = 1 D ∂ L ∂ X ^ i , k ( − 1 σ ^ i ) (5.9) \begin{aligned}\frac{\partial{L}}{\partial{\mu_i}}&=\frac{\partial{L}}{\partial{\hat{\sigma}_i}}\frac{\partial{\hat{\sigma}_i}}{\partial{\mu_i}}+\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_i}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_i}}\frac{\frac{1}{D}\sum_{j=1}^{D}{-2\left(X_{i,j}-\mu_i\right)}}{2\sqrt{\frac{1}{D}\sum_{j=1}^{D}{\left(X_{i,j}-\mu_i\right)^2}}}+\sum_{j=1}^{D}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_i}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_i}}\frac{\frac{1}{D}\sum_{j=1}^D{\left(\mu_i-X_{i,j}\right)}}{\sqrt{\frac{1}{D}\sum_{j=1}^D{\left(X_{i,j}-\mu_i\right)^2}}}+\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_i}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_i}}\frac{0}{\sqrt{\frac{1}{D}\sum_{j=1}^D{\left(X_{i,j}-\mu_i\right)^2}}}+\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_i}}}\\&=\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_i}}}\\&=\sum_{j=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}}\left(-\frac{1}{\hat{\sigma}_i}\right)\\&= \sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}}\left(-\frac{1}{\hat{\sigma}_i}\right) \end{aligned} \tag{5.9} μiL=σ^iLμiσ^i+j=1DX^i,jLμiX^i,j=σ^iL2D1j=1D(Xi,jμi)2 D1j=1D2(Xi,jμi)+j=1DX^i,jLμiX^i,j=σ^iLD1j=1D(Xi,jμi)2 D1j=1D(μiXi,j)+j=1DX^i,jLμiX^i,j=σ^iLD1j=1D(Xi,jμi)2 0+j=1DX^i,jLμiX^i,j=j=1DX^i,jLμiX^i,j=j=1DX^i,jL(σ^i1)=k=1DX^i,kL(σ^i1)(5.9)


∂ L ∂ X i , j = ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ X i , j + ∂ L ∂ σ i ^ ∂ σ i ^ ∂ X i , j + ∂ L ∂ μ i ∂ μ i ∂ X i , j = ∂ L ∂ X ^ i , j 1 σ ^ i + ∂ L ∂ σ i ^ 2 D ( X i , j − μ i ) 2 1 D ∑ j = 1 D ( X i , j − μ i ) 2 + ∂ L ∂ μ i 1 D = ∂ L ∂ X ^ i , j 1 σ ^ i + ∂ L ∂ σ i ^ 2 D ( X i , j − μ i ) 2 σ ^ i + ∂ L ∂ μ i 1 D = ∂ L ∂ X ^ i , j 1 σ ^ i + ∂ L ∂ σ j ^ 1 D X ^ i , j + ∂ L ∂ μ i 1 D = ∂ L ∂ X ^ i , j 1 σ ^ i − ( 1 σ ^ i ∑ k = 1 D ∂ L ∂ X ^ i , k X ^ i , k ) 1 D X ^ i , j − ∑ k = 1 D ∂ L ∂ X ^ i , k ( 1 σ ^ i ) 1 D = 1 D 1 σ ^ i ( D ∂ L ∂ X ^ i , j − X ^ i , j ∑ k = 1 D ∂ L ∂ X ^ i , k X ^ i , k − ∑ k = 1 D ∂ L ∂ X ^ i , k ) (5.10) \begin{aligned}\frac{\partial{L}}{\partial{X_{i,j}}}&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{X_{i,j}}}+\frac{\partial{L}}{\partial{\hat{\sigma_i}}}\frac{\partial{\hat{\sigma_i}}}{\partial{X_{i,j}}}+\frac{\partial{L}}{\partial{\mu_i}}\frac{\partial{\mu_i}}{\partial{X_{i,j}}}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_i}+\frac{\partial{L}}{\partial{\hat{\sigma_i}}}\frac{\frac{2}{D}\left(X_{i,j}-\mu_i\right)}{2\sqrt{\frac{1}{D}\sum_{j=1}^D{\left(X_{i,j}-\mu_i\right)^2}}}+\frac{\partial{L}}{\partial{\mu_i}}\frac{1}{D}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_i}+\frac{\partial{L}}{\partial{\hat{\sigma_i}}}\frac{\frac{2}{D}\left(X_{i,j}-\mu_i\right)}{2\hat{\sigma}_i}+\frac{\partial{L}}{\partial{\mu_i}}\frac{1}{D}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_i}+\frac{\partial{L}}{\partial{\hat{\sigma_j}}}\frac{1}{D}\hat{X}_{i,j}+\frac{\partial{L}}{\partial{\mu_i}}\frac{1}{D}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_i}-\left(\frac{1}{\hat{\sigma}_i}\sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}\hat{X}_{i,k}}\right)\frac{1}{D}\hat{X}_{i,j}-\sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}}\left(\frac{1}{\hat{\sigma}_i}\right)\frac{1}{D}\\&= \frac{1}{D}\frac{1}{\hat{\sigma}_i}\left(D\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}-\hat{X}_{i,j}\sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}\hat{X}_{i,k}}-\sum_{k=1}^D{\frac{\partial{L}}{\partial{\hat{X}_{i,k}}}}\right) \end{aligned}\tag{5.10} Xi,jL=X^i,jLXi,jX^i,j+σi^LXi,jσi^+μiLXi,jμi=X^i,jLσ^i1+σi^L2D1j=1D(Xi,jμi)2 D2(Xi,jμi)+μiLD1=X^i,jLσ^i1+σi^L2σ^iD2(Xi,jμi)+μiLD1=X^i,jLσ^i1+σj^LD1X^i,j+μiLD1=X^i,jLσ^i1(σ^i1k=1DX^i,kLX^i,k)D1X^i,jk=1DX^i,kL(σ^i1)D1=D1σ^i1(DX^i,jLX^i,jk=1DX^i,kLX^i,kk=1DX^i,kL)(5.10)

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
`tf.keras.layers.LayerNormalization` 是 TensorFlow 中的一个层,用于实现层归一化(Layer Normalization)操作。 层归一化是一种归一化技术,旨在在深度神经网络中减少内部协变量偏移(Internal Covariate Shift)。它可以将每个样本的特征进行归一化,而不是整个批次。 层归一化的计算方式如下: 1. 对于一个输入张量 x,计算其均值 μ 和方差 σ。 2. 使用以下公式对输入进行归一化:(x - μ) / sqrt(σ^2 + ε),其中 ε 是一个小的常数,用于防止除以零。 3. 使用两个可训练参数(缩放因子和偏移量)对归一化后的值进行缩放和平移:gamma * 归一化值 + beta。 `tf.keras.layers.LayerNormalization` 可以作为神经网络模型的一层,在模型中应用层归一化操作。它可以应用于任何维度的输入张量,并且可以在训练过程中自动更新可训练参数。 以下是一个使用 `tf.keras.layers.LayerNormalization` 的简单示例: ```python import tensorflow as tf # 创建一个 LayerNormalizationlayer_norm = tf.keras.layers.LayerNormalization() # 创建一个输入张量 input_tensor = tf.keras.Input(shape=(64,)) # 应用层归一化操作 normalized_tensor = layer_norm(input_tensor) # 创建一个模型 model = tf.keras.Model(inputs=input_tensor, outputs=normalized_tensor) ``` 在上述示例中,`input_tensor` 是一个形状为 (batch_size, 64) 的输入张量。`normalized_tensor` 是应用层归一化操作后的输出张量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值