Batch Normalization 反向传播（backpropagation ）公式的推导_batch normalization back propogatiom-CSDN博客

本文链接：https://blog.csdn.net/lanchunhui/article/details/70187880

What does the gradient flowing through batch normalization looks like ?

反向传播梯度下降权值参数更新公式的推导全依赖于复合函数求梯度时的链式法则。

1. Batch Normalization

给定输入样本 $x\in \mathbb R^{N\times D}$ ，经过一个神经元个数为 $H$ 的隐层，负责连接输入层和隐层的权值矩阵 $w\in \mathbb R^{D\times H}$ ，以及偏置向量 $b\in \mathbb R^H$ 。

Batch Normalization 的过程如下：

仿射变换（affine transformation）

$h = X W + b$ $h=XW+b$
显然 $h\in \mathbb R^{N\times H}$
batch normalization 变换：

$y = γ h^+ β$ $y=\gamma \hat h+\beta$

其中 $\gamma, \beta$ 是待学习的参数， $\hat h$ 是 $h$ 去均值和方差归一化的形式：

$h^= (h - μ) (σ 2 + ϵ) - 1 / 2$ $\hat h=(h-\mu)(\sigma^2+\epsilon)^{-1/2}$

进一步其标量形式如下：

$h ˆ k l = (h k l - μ l) (σ 2 l + ϵ) - 1 / 2$ $\widehat h_{kl}=(h_{kl}-\mu_l)(\sigma_l^2+\epsilon)^{-1/2}$
$l=\{1, \ldots, H\}$ ， $\mu$ 和 $\sigma$ 分别是对矩阵 $h\in \mathbb R^{N\times H}$ 的各个属性列，求均值和方差，最终构成的均值向量和方差向量。

$μ l = 1 N \sum p h p l, σ 2 l = 1 N \sum p (h p l - μ l) 2$ $\mu_l=\frac1N\sum_ph_{pl}, \quad \sigma_l^2=\frac{1}{N}\sum_p\left(h_{pl}-\mu_l\right)^2$

2. $\frac{\partial \mathcal L}{\partial h}, \frac{\partial \mathcal L}{\partial \gamma},\frac{\partial \mathcal L}{\partial \beta}$ 的计算

首先我们来看损失函数 $\mathcal L$ 关于隐层输入偏导的计算：

d L d h = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ d L d h 11 . . d L d h N 1 . . d L d h k l . . . d L d h 1 H . . d L d h N H ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ .

$\begin{equation} \frac{d\mathcal{L}}{dh} = \begin{pmatrix} \frac{d\mathcal{L}}{dh_{11}} & .. & \frac{d\mathcal{L}}{dh_{1H}} \\ .. & \frac{d\mathcal{L}}{dh_{kl}} & .. \\ \frac{d\mathcal{L}}{dh_{N1}} & ... & \frac{d\mathcal{L}}{dh_{NH}} \end{pmatrix}. \end{equation}$

又由于：

h = X W + b, h \Rightarrow h^, h^\Rightarrow y

$h=XW+b, h ⇒ \hat h, \hat h ⇒ y$

由链式法则可知：

\partial L \partial h i j = \sum k, l \partial L \partial y k l \partial y k l \partial h ^ k l \partial h ^ k l \partial h i j

$\frac{\partial \mathcal L}{\partial h_{ij}}=\sum_{k,l}\frac{\partial \mathcal L}{\partial y_{kl}}\frac{\partial \mathcal y_{kl}}{\partial \hat h_{kl}}\frac{\partial \hat h_{kl}}{\partial h_{ij}}$

显然其中 $\frac{\partial y_{kl}}{\partial \hat h_{kl}}=\gamma_l$ ，

又由于：

h ˆ k l = (h k l - μ l) (σ 2 l + ϵ) - 1 / 2, μ l = 1 N \sum p h p l, σ 2 l = 1 N \sum p (h p l - μ l) 2

$\widehat h_{kl}=(h_{kl}-\mu_l)(\sigma_l^2+\epsilon)^{-1/2},\quad \mu_l=\frac1N\sum_p h_{pl},\;\sigma_l^2=\frac1N\sum_p (h_{pl}-\mu_l)^2$

所以：

d h ^ k l d h i j = (δ i k δ j l - 1 N δ j l) (σ 2 l + ϵ) - 1 / 2 - 1 2 (h k l - μ l) d σ 2 l d h i j (σ 2 l + ϵ) - 3 / 2

$\begin{eqnarray} \frac{d\hat{h}_{kl}}{dh_{ij}} = (\delta_{ik}\delta_{jl}-\frac{1}{N}\delta_{jl})(\sigma_l^2+\epsilon)^{-1/2}-\frac{1}{2}(h_{kl}-\mu_l)\frac{d\sigma_l^2}{dh_{ij}}(\sigma_l^2+\epsilon)^{-3/2} \end{eqnarray}$

根据 $\sigma_l^2$ 与 $h_{ij}$ 的计算公式可知：

d σ 2 l d h i j = = = = 2 N \sum p (h p l - μ l) (δ i p δ j l - 1 N δ j l) （ 只 有 在 p = 1 时 ） 2 N (h i l - μ l) δ j l - 2 N δ j l ⎛ ⎝ 1 N \sum p (h p l - μ l) ⎞ ⎠ 2 N (h i l - μ l) δ j l - 2 N δ j l ⎛ ⎝ 1 N \sum p h p l - μ l ⎞ ⎠ （ 显 然 右 侧 为 0 ） 2 N (h i l - μ l) δ j l

$\begin{split} \frac{d\sigma_l^2}{dh_{ij}}=&\frac2N\sum_p(h_{pl}-\mu_l)(\delta_{ip}\delta_{jl}-\frac1N\delta_{jl})（只有在 p=1 时）\\ =&\frac2N(h_{il}-\mu_l)\delta_{jl}-\frac2N\delta_{jl}\left(\frac1N\sum_p(h_{pl}-\mu_l)\right)\\ =&\frac2N(h_{il}-\mu_l)\delta_{jl}-\frac2N\delta_{jl}\left(\frac1N\sum_ph_{pl}-\mu_l\right)（显然右侧为0）\\ =&\frac2N(h_{il}-\mu_l)\delta_{jl} \end{split}$