BN层反向传播公式推导

最后loss对参数的总梯度,是所有的梯度之和
前向:
x i ^ = x i − μ σ 2 \hat{x_i} = \frac{x_i-\mu}{\sqrt{\sigma^2}} xi^=σ2 xiμ
y = γ x i + β y=\gamma x_i+\beta y=γxi+β
基本求导:
∂ σ 2 ∂ μ = 1 m ∑ i = 1 m [ − 2 ( x i − μ ) ] \frac{\partial \sigma^{2}}{\partial \mu}= \frac{1}{m}\sum_{i=1}^{m}[-2(x_i-\mu)] μσ2=m1i=1m[2(xiμ)]
∂ x i ^ ∂ μ = − 1 σ 2 + ∂ x i ^ ∂ σ 2 ∂ σ 2 ∂ μ \frac{\partial \hat{x_i}}{\partial \mu}= \frac{-1}{\sqrt{\sigma^{2}}}+\frac{\partial \hat{x_i}}{\partial \sigma^{2}}\frac{\partial \sigma^{2}}{\partial \mu} μxi^=σ2 1+σ2xi^μσ2
∂ x i ^ ∂ σ 2 = ( x i − μ ) ( − 1 2 ) ( σ 2 ) − 3 2 \frac{\partial \hat{x_i}}{\partial \sigma^{2}}= (x_i-\mu)(-\frac{1}{2})(\sigma^{2})^{\frac{-3}{2}} σ2xi^=(xiμ)(21)(σ2)23
推导:
① ∂ l ∂ σ 2 = ∑ i = 1 m ∂ l ∂ x i ^ [ ∂ x i ^ ∂ σ 2 ] = ∑ i = 1 m ∂ l ∂ x i ^ [ ( x i − μ ) ( − 1 2 ) ( σ 2 ) − 3 2 ] ①\frac{\partial l}{\partial \sigma^{2}}= \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [\frac{\partial \hat{x_i}}{\partial \sigma^{2}}]= \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [(x_i-\mu)(-\frac{1}{2})(\sigma^2)^{\frac{-3}{2}}] σ2l=i=1mxi^l[σ2xi^]=i=1mxi^l[(xiμ)(21)(σ2)23]
② ∂ l ∂ μ = ∑ i = 1 m ∂ l ∂ x i ^ [ ∂ x i ^ ∂ μ ] = ∑ i = 1 m ∂ l ∂ x i ^ [ x i − μ σ 2 ∂ μ ] = ∑ i = 1 m ∂ l ∂ x i ^ [ − 1 σ 2 + ∂ x i ^ ∂ σ 2 ∂ σ 2 ∂ μ ] ②\frac{\partial l}{\partial \mu}= \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [\frac{\partial \hat{x_i}}{\partial \mu}]= \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [\frac{ \frac{x_i-\mu}{\sqrt{\sigma^2}}}{\partial \mu}] = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [\frac{-1}{\sqrt{\sigma^2}}+\frac{\partial \hat{x_i}}{\partial \sigma^2}\frac{\partial \sigma^2}{\partial \mu}] μl=i=1mxi^l[μxi^]=i=1mxi^l[μσ2 xiμ]=i=1mxi^l[σ2 1+σ2xi^μσ2]
= ∑ i = 1 m ∂ l ∂ x i ^ [ − 1 σ 2 + ( x i − μ ) ( − 1 2 ) ( σ 2 ) − 3 2 ∗ ∂ σ 2 ∂ μ ] = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} [\frac{-1}{\sqrt{\sigma^2}}+(x_i-\mu)(-\frac{1}{2})(\sigma^{2})^{\frac{-3}{2}}*\frac{\partial \sigma^2}{\partial \mu}] =i=1mxi^l[σ2 1+(xiμ)(21)(σ2)23μσ2]
= ∑ i = 1 m ∂ l ∂ x i ^ − 1 σ 2 + [ ∑ i = 1 m ∂ l ∂ x i ^ ( x i − μ ) ( − 1 2 ) ( σ 2 ) − 3 2 ] ∗ ∂ σ 2 ∂ μ = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} \frac{-1}{\sqrt{\sigma^2}} + [\sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}}(x_i-\mu)(-\frac{1}{2})(\sigma^{2})^{\frac{-3}{2}}] * \frac{\partial \sigma^2}{\partial \mu} =i=1mxi^lσ2 1+[i=1mxi^l(xiμ)(21)(σ2)23]μσ2
= ∑ i = 1 m ∂ l ∂ x i ^ − 1 σ 2 + ∂ l ∂ σ 2 ∗ ∂ σ 2 ∂ μ = ∑ i = 1 m ∂ l ∂ x i ^ − 1 σ 2 + ∂ l ∂ σ 2 ∗ 1 m ∑ i = 1 m [ − 2 ( x i − μ ) ] = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} \frac{-1}{\sqrt{\sigma^2}} + \frac{\partial l}{\partial \sigma^{2}} * \frac{\partial \sigma^{2}}{\partial \mu} = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x_i}} \frac{-1}{\sqrt{\sigma^2}} + \frac{\partial l}{\partial \sigma^{2}} * \frac{1}{m}\sum_{i=1}^{m}[-2(x_i-\mu)] =i=1mxi^lσ2 1+σ2lμσ2=i=1mxi^lσ2 1+σ2lm1i=1m[2(xiμ)]
码完了这一条才想起来好像有学过多元函数求中间变量偏导的内容。。。去百度找了,贴图在下面了
在这里插入图片描述
在这里插入图片描述
所以根据这个全导数求导法则:
x i ^ = x i − μ σ 2 \hat{x_i} = \frac{x_i-\mu}{\sqrt{\sigma^2}} xi^=σ2 xiμ
③ ∂ l ∂ x i ^ = ∂ l ∂ x i ^ ∂ x i ^ ∂ x i + ∂ l ∂ μ ∂ μ ∂ x i + ∂ l ∂ σ 2 ∂ σ 2 ∂ x i ③\frac{\partial l}{\partial \hat{x_i}} = \frac{\partial l}{\partial \hat{x_i}}\frac{\partial \hat{x_i}}{\partial x_i}+\frac{\partial l}{\partial \mu}\frac{\partial \mu}{\partial x_i}+\frac{\partial l}{\partial \sigma^2}\frac{\partial \sigma^2}{\partial x_i} xi^l=xi^lxixi^+μlxiμ+σ2lxiσ2
= ∂ l ∂ x i ^ 1 σ 2 + ∂ l ∂ μ 1 x + ∂ l ∂ σ 2 2 m ( x i − μ ) =\frac{\partial l}{\partial \hat{x_i}} \frac{1}{\sqrt{\sigma^2}} + \frac{\partial l}{\partial \mu}\frac{1}{x} + \frac{\partial l}{\partial \sigma^2}\frac{2}{m}(x_i-\mu) =xi^lσ2 1+μlx1+σ2lm2(xiμ)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Batch Normalization (BN)层的反向传播涉及到计算对输入和参数的梯度。下面是BN层反向传播的基本步骤: 1. 针对BN层的输入$x$和参数$\gamma$(缩放因子)和$\beta$(平移因子),计算前向传播时使用的均值$\mu$和方差$\sigma^2$。 2. 计算输入$x$的标准化值$\hat{x}$: $$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$$ 其中$\epsilon$是一个很小的常数,用来避免除以零。 3. 计算BN层的输出$y$: $$y = \gamma \hat{x} + \beta$$ 4. 计算关于$y$的损失函数的梯度$\frac{\partial L}{\partial y}$,其中$L$是损失函数。 5. 计算关于$\gamma$和$\beta$的梯度: $$\frac{\partial L}{\partial \gamma} = \sum \frac{\partial L}{\partial y} \cdot \hat{x}$$ $$\frac{\partial L}{\partial \beta} = \sum \frac{\partial L}{\partial y}$$ 这里的求和操作是对batch中所有样本进行求和。 6. 计算关于$\hat{x}$的梯度$\frac{\partial L}{\partial \hat{x}}$: $$\frac{\partial L}{\partial \hat{x}} = \frac{\partial L}{\partial y} \cdot \gamma$$ 7. 计算关于$x$的梯度$\frac{\partial L}{\partial x}$: $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \hat{x}} \cdot \frac{1}{\sqrt{\sigma^2 + \epsilon}} - \frac{1}{N} \cdot \frac{\partial L}{\partial \hat{x}} \cdot \frac{1}{\sqrt{\sigma^2 + \epsilon}} - \frac{1}{N} \cdot \sum \frac{\partial L}{\partial \hat{x}} \cdot (x - \mu) \cdot \frac{1}{(\sigma^2 + \epsilon)^{3/2}}$$ 其中$N$是batch的大小。 以上是BN层反向传播的基本步骤。这些步骤可以通过链式法则和计算图来推导得到。注意,在实际实现中,通常会使用计算库中提供的函数来计算BN层反向传播,以提高效率和准确性。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值