参考资料
cs231n Course Materials: Backprop
Derivatives, Backpropagation, and Vectorization
cs231n Lecture 4:Neural Networks and Backpropagation
cs231n Assignment 2
笔记: Batch Normalization及其反向传播
4. Batch Normalization
前向传播
"""
Forward pass for batch normalization.
Input:
- X: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- Y: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
μ
j
=
1
N
∑
i
=
1
N
X
i
,
j
(4.1)
\mu_{j}=\frac{1}{N}\sum_{i=1}^{N}{X_{i,j}}\tag{4.1}
μj=N1i=1∑NXi,j(4.1)
σ
j
2
=
1
N
∑
i
=
1
N
(
X
i
,
j
−
μ
j
)
2
(4.2)
\sigma_j^2=\frac{1}{N}\sum_{i=1}^{N}{\left(X_{i,j}-\mu_j\right)^2}\tag{4.2}
σj2=N1i=1∑N(Xi,j−μj)2(4.2)
X
^
i
,
j
=
X
i
,
j
−
μ
j
σ
j
2
+
ϵ
(4.3)
\hat{X}_{i,j}=\frac{X_{i,j}-\mu_j}{\sqrt{\sigma^2_{j}+\epsilon}}\tag{4.3}
X^i,j=σj2+ϵXi,j−μj(4.3)
在式(4.3)中,
ϵ
\epsilon
ϵ是一个非常小的正数,防止分母变成0。
在此基础上,引入可学习参数
β
\beta
β、
γ
\gamma
γ:
Y
i
,
j
=
γ
j
X
^
i
,
j
+
β
j
(4.4)
Y_{i,j}=\gamma_j\hat{X}_{i,j}+\beta_j\tag{4.4}
Yi,j=γjX^i,j+βj(4.4)
需要注意,式(4.1)和式(4.2)计算的均值和方差只在训练阶段使用,测试阶段使用的均值和方差可由训练阶段的均值和方差求滑动平均值得到。
反向传播
将上述计算过程进行更细的拆分,就可得到下面的计算图:
下面从后往前推导反向传播。
Y
i
,
j
=
X
^
i
,
j
γ
+
β
j
(4.5)
Y_{i,j}=\hat{X}^\gamma_{i,j}+\beta_j\tag{4.5}
Yi,j=X^i,jγ+βj(4.5)
由式(4.5)可得:
∂
L
∂
β
j
=
∑
i
∂
L
∂
Y
i
,
j
∂
Y
i
,
j
∂
β
j
=
∑
i
∂
L
∂
Y
i
,
j
⋅
1
=
∑
i
∂
L
∂
Y
i
,
j
(4.6)
\begin{aligned}\frac{\partial{L}}{\partial{\beta_j}}&=\sum_{i}{\frac{\partial{L}}{\partial{Y_{i,j}}}\frac{\partial{Y_{i,j}}}{\partial{\beta_j}}}\\&=\sum_{i}{\frac{\partial{L}}{\partial{Y_{i,j}}}\cdot1}\\&=\sum_{i}{\frac{\partial{L}}{\partial{Y_{i,j}}}}\end{aligned}\tag{4.6}
∂βj∂L=i∑∂Yi,j∂L∂βj∂Yi,j=i∑∂Yi,j∂L⋅1=i∑∂Yi,j∂L(4.6)
∂ L ∂ X ^ i , j γ = ∂ L ∂ Y i , j ∂ Y i , j ∂ X ^ i , j γ = ∂ L ∂ Y i , j ⋅ 1 = ∂ L ∂ Y i , j (4.7) \begin{aligned}\frac{\partial{L}}{\partial{\hat{X}^{\gamma}_{i,j}}}&=\frac{\partial{L}}{\partial{Y_{i,j}}}\frac{\partial{Y_{i,j}}}{\partial{\hat{X}^{\gamma}_{i,j}}}\\&=\frac{\partial{L}}{\partial{Y_{i,j}}}\cdot1\\&=\frac{\partial{L}}{\partial{Y_{i,j}}}\end{aligned}\tag{4.7} ∂X^i,jγ∂L=∂Yi,j∂L∂X^i,jγ∂Yi,j=∂Yi,j∂L⋅1=∂Yi,j∂L(4.7)
X
^
i
,
j
γ
=
γ
j
X
^
i
,
j
(4.8)
\hat{X}^\gamma_{i,j}=\gamma_j\hat{X}_{i,j}\tag{4.8}
X^i,jγ=γjX^i,j(4.8)
由式(4.8)可得
∂
L
∂
γ
j
=
∑
i
∂
L
∂
X
^
i
,
j
γ
∂
X
^
i
,
j
γ
∂
γ
j
=
∑
i
∂
L
∂
X
^
i
,
j
γ
X
^
i
,
j
(4.9)
\begin{aligned}\frac{\partial{L}}{\partial{\gamma_j}}&=\sum_{i}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}^{\gamma}}}\frac{\partial{\hat{X}_{i,j}^{\gamma}}}{\partial{\gamma_j}}}\\&=\sum_{i}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}^{\gamma}}}\hat{X}_{i,j}}\end{aligned}\tag{4.9}
∂γj∂L=i∑∂X^i,jγ∂L∂γj∂X^i,jγ=i∑∂X^i,jγ∂LX^i,j(4.9)
∂ L ∂ X ^ i , j = ∂ L ∂ X ^ i , j γ ∂ X ^ i , j γ ∂ X ^ i , j = ∂ L ∂ X ^ i , j γ γ j (4.10) \begin{aligned}\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}^{\gamma}}}\frac{\partial{\hat{X}_{i,j}^{\gamma}}}{\partial{\hat{X}_{i,j}}}\\&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}^{\gamma}}}\gamma_j\end{aligned}\tag{4.10} ∂X^i,j∂L=∂X^i,jγ∂L∂X^i,j∂X^i,jγ=∂X^i,jγ∂Lγj(4.10)
X
^
i
,
j
=
(
1
/
σ
^
j
)
X
i
,
j
m
(4.11)
\hat{X}_{i,j}=(1/\hat{\sigma}_j)X^m_{i,j}\tag{4.11}
X^i,j=(1/σ^j)Xi,jm(4.11)
由式(4.11)可得
∂
L
∂
(
1
/
σ
^
j
)
=
∑
i
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
(
1
/
σ
^
j
)
=
∑
i
∂
L
∂
X
^
i
,
j
X
i
,
j
m
(4.12)
\begin{aligned}\frac{\partial{L}}{\partial{(1/\hat{\sigma}_j)}}&=\sum_{i}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{(1/\hat{\sigma}_j)}}}\\&=\sum_{i}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}X^m_{i,j}}\end{aligned}\tag{4.12}
∂(1/σ^j)∂L=i∑∂X^i,j∂L∂(1/σ^j)∂X^i,j=i∑∂X^i,j∂LXi,jm(4.12)
1
/
σ
^
j
=
1
σ
^
j
(4.13)
1/\hat{\sigma}_j=\frac{1}{\hat{\sigma}_{j}}\tag{4.13}
1/σ^j=σ^j1(4.13)
由式(4.13)可得
∂
L
∂
σ
^
j
=
∂
L
∂
(
1
/
σ
^
j
)
∂
(
1
/
σ
^
j
)
∂
σ
^
j
=
−
∂
L
∂
(
1
/
σ
^
j
)
1
σ
^
j
2
=
−
∂
L
∂
(
1
/
σ
^
j
)
1
σ
j
2
+
ϵ
(4.14)
\begin{aligned}\frac{\partial{L}}{\partial{\hat{\sigma}_j}}&=\frac{\partial{L}}{\partial{(1/\hat{\sigma}_j)}}\frac{\partial{(1/\hat{\sigma}_j)}}{\partial{\hat{\sigma}_j}}\\&=-\frac{\partial{L}}{\partial{(1/\hat{\sigma}_j)}}\frac{1}{\hat{\sigma}_j^2}\\&=-\frac{\partial{L}}{\partial{(1/\hat{\sigma}_j)}}\frac{1}{\sigma_j^2+\epsilon}\end{aligned}\tag{4.14}
∂σ^j∂L=∂(1/σ^j)∂L∂σ^j∂(1/σ^j)=−∂(1/σ^j)∂Lσ^j21=−∂(1/σ^j)∂Lσj2+ϵ1(4.14)
σ
^
j
=
σ
j
2
+
ϵ
(4.15)
\hat{\sigma}_j=\sqrt{\sigma_j^2+\epsilon}\tag{4.15}
σ^j=σj2+ϵ(4.15)
由式(4.15)可得
∂
L
∂
σ
j
2
=
∂
L
∂
σ
^
j
∂
σ
^
j
∂
σ
j
2
=
∂
L
∂
σ
^
j
1
2
σ
j
2
+
ϵ
(4.16)
\begin{aligned}\frac{\partial{L}}{\partial{\sigma_j^2}}&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{\partial{\hat{\sigma}_j}}{\partial{\sigma_j^2}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{1}{2\sqrt{\sigma^2_j+\epsilon}}\end{aligned}\tag{4.16}
∂σj2∂L=∂σ^j∂L∂σj2∂σ^j=∂σ^j∂L2σj2+ϵ1(4.16)
σ
j
2
=
1
N
∑
i
X
i
,
j
2
(4.17)
\sigma^2_j=\frac{1}{N}\sum_i{X^2_{i,j}}\tag{4.17}
σj2=N1i∑Xi,j2(4.17)
由式(4.17)可得
∂
L
∂
X
i
,
j
2
=
∂
L
∂
σ
j
2
∂
σ
j
2
∂
X
i
,
j
2
=
∂
L
∂
σ
j
2
1
N
(4.18)
\begin{aligned}\frac{\partial{L}}{\partial{X_{i,j}^2}}&=\frac{\partial{L}}{\partial{\sigma_j^2}}\frac{\partial{\sigma_j^2}}{\partial{X_{i,j}^2}}\\&=\frac{\partial{L}}{\partial{\sigma_j^2}}\frac{1}{N}\tag{4.18}\end{aligned}
∂Xi,j2∂L=∂σj2∂L∂Xi,j2∂σj2=∂σj2∂LN1(4.18)
X
i
,
j
2
=
(
X
i
,
j
m
)
2
(4.19)
X^2_{i,j}=\left(X^m_{i,j}\right)^2\tag{4.19}
Xi,j2=(Xi,jm)2(4.19)
由式(4.11)和式(4.19)可得
∂
L
∂
X
i
,
j
m
=
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
X
i
,
j
m
+
∂
L
∂
X
i
,
j
2
∂
X
i
,
j
2
∂
X
i
,
j
m
=
∂
L
∂
X
^
i
,
j
(
1
/
σ
^
j
)
+
∂
L
∂
X
i
,
j
2
⋅
2
X
i
,
j
m
(4.20)
\begin{aligned}\frac{\partial{L}}{\partial{X_{i,j}^m}}&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{X^m_{i,j}}}+\frac{\partial{L}}{\partial{X^2_{i,j}}}\frac{\partial{X^2_{i,j}}}{\partial{X^m_{i,j}}}\\&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}(1/\hat{\sigma}_j)+\frac{\partial{L}}{\partial{X^2_{i,j}}}\cdot2X^m_{i,j}\end{aligned}\tag{4.20}
∂Xi,jm∂L=∂X^i,j∂L∂Xi,jm∂X^i,j+∂Xi,j2∂L∂Xi,jm∂Xi,j2=∂X^i,j∂L(1/σ^j)+∂Xi,j2∂L⋅2Xi,jm(4.20)
X i , j m = X i , j − μ j (4.21) X^m_{i,j}=X_{i,j}-\mu_j\tag{4.21} Xi,jm=Xi,j−μj(4.21)
∂ L ∂ μ j = ∑ i ∂ L ∂ X i , j m ∂ X i , j m ∂ μ j = ∑ i ∂ L ∂ X i , j m ⋅ ( − 1 ) = − ∑ i ∂ L ∂ X i , j m (4.22) \begin{aligned}\frac{\partial{L}}{\partial{\mu_j}}&=\sum_i{\frac{\partial{L}}{\partial{X^m_{i,j}}}\frac{\partial{X^m_{i,j}}}{\partial{\mu_j}}}\\&=\sum_i{\frac{\partial{L}}{\partial{X^m_{i,j}}}\cdot(-1)}\\&=-\sum_i{\frac{\partial{L}}{\partial{X^m_{i,j}}}}\end{aligned}\tag{4.22} ∂μj∂L=i∑∂Xi,jm∂L∂μj∂Xi,jm=i∑∂Xi,jm∂L⋅(−1)=−i∑∂Xi,jm∂L(4.22)
μ
j
=
1
N
∑
i
X
i
,
j
(4.23)
\mu_j=\frac{1}{N}\sum_i{X_{i,j}}\tag{4.23}
μj=N1i∑Xi,j(4.23)
由式(4.21)和式(4.23)得
∂
L
∂
X
i
,
j
=
∂
L
∂
X
i
,
j
m
∂
X
i
,
j
m
∂
X
i
,
j
+
∂
L
∂
μ
j
∂
μ
j
∂
X
i
,
j
=
∂
L
∂
X
i
,
j
m
⋅
1
+
∂
L
∂
μ
j
1
N
(4.24)
\begin{aligned}\frac{\partial{L}}{\partial{X_{i,j}}}&=\frac{\partial{L}}{\partial{X^m_{i,j}}}\frac{\partial{X^m_{i,j}}}{\partial{X_{i,j}}}+\frac{\partial{L}}{\partial{\mu_j}}\frac{\partial{\mu_j}}{\partial{X_{i,j}}}\\&=\frac{\partial{L}}{\partial{X^m_{i,j}}}\cdot1+\frac{\partial{L}}{\partial{\mu_j}}\frac{1}{N}\end{aligned}\tag{4.24}
∂Xi,j∂L=∂Xi,jm∂L∂Xi,j∂Xi,jm+∂μj∂L∂Xi,j∂μj=∂Xi,jm∂L⋅1+∂μj∂LN1(4.24)
以上,即为Batch Normalization的反向传播计算过程。
反向传播<简化版本>
实际上,可以将刚才的计算图进行简化(把一些节点合起来),从而减少中间变量。然后这个图不知道还算不算计算图,,,其实就是按着最开始的公式硬算2333。
根据式(4.1-4)可得
{
μ
j
=
1
N
∑
i
X
i
,
j
σ
^
j
=
1
N
∑
i
(
X
i
,
j
−
μ
j
)
2
+
ϵ
X
^
i
,
j
=
X
i
,
j
−
μ
j
σ
j
^
Y
i
,
j
=
γ
j
X
^
i
,
j
+
β
j
(4.25)
\left\{\begin{aligned}\mu_j&=\frac{1}{N}\sum_i{X_{i,j}}\\\hat{\sigma}_{j}&=\sqrt{\frac{1}{N}\sum_i{\left(X_{i,j}-\mu_j\right)^2}+\epsilon}\\\hat{X}_{i,j}&=\frac{X_{i,j}-\mu_j}{\hat{\sigma_j}}\\Y_{i,j}&=\gamma_j\hat{X}_{i,j}+\beta_j\end{aligned}\right.\tag{4.25}
⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧μjσ^jX^i,jYi,j=N1i∑Xi,j=N1i∑(Xi,j−μj)2+ϵ=σj^Xi,j−μj=γjX^i,j+βj(4.25)
∂
L
∂
β
j
=
∑
i
∂
L
∂
Y
i
,
j
∂
Y
i
,
j
∂
β
j
=
∑
i
∂
L
∂
Y
i
,
j
(4.26)
\frac{\partial{L}}{\partial{\beta_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\frac{\partial{Y_{i,j}}}{\partial{\beta_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\tag{4.26}
∂βj∂L=i∑∂Yi,j∂L∂βj∂Yi,j=i∑∂Yi,j∂L(4.26)
∂
L
∂
γ
j
=
∑
i
∂
L
∂
Y
i
,
j
∂
Y
i
,
j
∂
γ
j
=
∑
i
∂
L
∂
Y
i
,
j
X
^
i
,
j
(4.27)
\frac{\partial{L}}{\partial{\gamma_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}}\frac{\partial{Y_{i,j}}}{\partial{\gamma_j}}=\sum_i{\frac{\partial{L}}{\partial{Y_{i,j}}}\hat{X}_{i,j}}\tag{4.27}
∂γj∂L=i∑∂Yi,j∂L∂γj∂Yi,j=i∑∂Yi,j∂LX^i,j(4.27)
∂
L
∂
X
^
i
,
j
=
∂
L
∂
Y
i
,
j
∂
Y
i
,
j
∂
X
^
i
,
j
=
∂
L
∂
Y
i
,
j
γ
j
(4.28)
\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}=\frac{\partial{L}}{\partial{Y_{i,j}}}\frac{\partial{Y_{i,j}}}{\partial{\hat{X}_{i,j}}}=\frac{\partial{L}}{\partial{Y_{i,j}}}\gamma_j\tag{4.28}
∂X^i,j∂L=∂Yi,j∂L∂X^i,j∂Yi,j=∂Yi,j∂Lγj(4.28)
∂
L
∂
σ
^
j
=
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
σ
^
j
=
−
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
X
i
,
j
−
μ
j
(
σ
j
^
)
2
=
−
1
σ
j
∑
k
=
1
N
∂
L
∂
X
^
k
,
j
X
^
k
,
j
(4.29)
\begin{aligned}\frac{\partial{L}}{\partial{\hat{\sigma}_{j}}}&=\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\hat{\sigma}_{j}}}}\\&=-\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{X_{i,j}-\mu_j}{\left(\hat{\sigma_j}\right)^2}}\\&= -\frac{1}{\sigma_j}\sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}\hat{X}_{k,j}} \end{aligned}\tag{4.29}
∂σ^j∂L=i=1∑N∂X^i,j∂L∂σ^j∂X^i,j=−i=1∑N∂X^i,j∂L(σj^)2Xi,j−μj=−σj1k=1∑N∂X^k,j∂LX^k,j(4.29)
∂
L
∂
μ
j
=
∂
L
∂
σ
^
j
∂
σ
^
j
∂
μ
j
+
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
μ
j
=
∂
L
∂
σ
^
j
1
N
∑
i
=
1
N
−
2
(
X
i
,
j
−
μ
j
)
2
1
N
∑
i
=
1
N
(
X
i
,
j
−
μ
j
)
2
+
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
μ
j
=
∂
L
∂
σ
^
j
1
N
∑
i
=
1
N
(
μ
j
−
X
i
,
j
)
1
N
∑
i
=
1
N
(
X
i
,
j
−
μ
j
)
2
+
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
μ
j
=
∂
L
∂
σ
^
j
0
1
N
∑
i
=
1
N
(
X
i
,
j
−
μ
j
)
2
+
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
μ
j
=
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
∂
X
^
i
,
j
∂
μ
j
=
∑
i
=
1
N
∂
L
∂
X
^
i
,
j
(
−
1
σ
^
j
)
=
∑
k
=
1
N
∂
L
∂
X
^
k
,
j
(
−
1
σ
^
j
)
(4.30)
\begin{aligned}\frac{\partial{L}}{\partial{\mu_j}}&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{\partial{\hat{\sigma}_j}}{\partial{\mu_j}}+\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_j}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{\frac{1}{N}\sum_{i=1}^{N}{-2\left(X_{i,j}-\mu_j\right)}}{2\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left(X_{i,j}-\mu_j\right)^2}}}+\sum_{i=1}^{N}{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_j}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{\frac{1}{N}\sum_{i=1}^N{\left(\mu_j-X_{i,j}\right)}}{\sqrt{\frac{1}{N}\sum_{i=1}^N{\left(X_{i,j}-\mu_j\right)^2}}}+\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_j}}}\\&=\frac{\partial{L}}{\partial{\hat{\sigma}_j}}\frac{0}{\sqrt{\frac{1}{N}\sum_{i=1}^N{\left(X_{i,j}-\mu_j\right)^2}}}+\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_j}}}\\&=\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{\mu_j}}}\\&=\sum_{i=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}}\left(-\frac{1}{\hat{\sigma}_j}\right)\\&= \sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}}\left(-\frac{1}{\hat{\sigma}_j}\right) \end{aligned} \tag{4.30}
∂μj∂L=∂σ^j∂L∂μj∂σ^j+i=1∑N∂X^i,j∂L∂μj∂X^i,j=∂σ^j∂L2N1∑i=1N(Xi,j−μj)2N1∑i=1N−2(Xi,j−μj)+i=1∑N∂X^i,j∂L∂μj∂X^i,j=∂σ^j∂LN1∑i=1N(Xi,j−μj)2N1∑i=1N(μj−Xi,j)+i=1∑N∂X^i,j∂L∂μj∂X^i,j=∂σ^j∂LN1∑i=1N(Xi,j−μj)20+i=1∑N∂X^i,j∂L∂μj∂X^i,j=i=1∑N∂X^i,j∂L∂μj∂X^i,j=i=1∑N∂X^i,j∂L(−σ^j1)=k=1∑N∂X^k,j∂L(−σ^j1)(4.30)
∂ L ∂ X i , j = ∂ L ∂ X ^ i , j ∂ X ^ i , j ∂ X i , j + ∂ L ∂ σ j ^ ∂ σ j ^ ∂ X i , j + ∂ L ∂ μ j ∂ μ j ∂ X i , j = ∂ L ∂ X ^ i , j 1 σ ^ j + ∂ L ∂ σ j ^ 2 N ( X i , j − μ j ) 2 1 N ∑ i = 1 N ( X i , j − μ j ) 2 + ∂ L ∂ μ j 1 N = ∂ L ∂ X ^ i , j 1 σ ^ j + ∂ L ∂ σ j ^ 2 N ( X i , j − μ j ) 2 σ ^ j + ∂ L ∂ μ j 1 N = ∂ L ∂ X ^ i , j 1 σ ^ j + ∂ L ∂ σ j ^ 1 N X ^ i , j + ∂ L ∂ μ j 1 N = ∂ L ∂ X ^ i , j 1 σ ^ j − ( 1 σ ^ j ∑ k = 1 N ∂ L ∂ X ^ k , j X ^ k , j ) 1 N X ^ i , j − ∑ k = 1 N ∂ L ∂ X ^ k , j ( 1 σ ^ j ) 1 N = 1 N 1 σ ^ j ( N ∂ L ∂ X ^ i , j − X ^ i , j ∑ k = 1 N ∂ L ∂ X ^ k , j X ^ k , j − ∑ k = 1 N ∂ L ∂ X ^ k , j ) = γ j N σ ^ j ( N ∂ L ∂ Y i , j − X ^ i , j ∑ k = 1 N ∂ L ∂ Y k , j X ^ k , j − ∑ k = 1 N ∂ L ∂ Y k , j ) (4.31) \begin{aligned}\frac{\partial{L}}{\partial{X_{i,j}}}&=\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{\partial{\hat{X}_{i,j}}}{\partial{X_{i,j}}}+\frac{\partial{L}}{\partial{\hat{\sigma_j}}}\frac{\partial{\hat{\sigma_j}}}{\partial{X_{i,j}}}+\frac{\partial{L}}{\partial{\mu_j}}\frac{\partial{\mu_j}}{\partial{X_{i,j}}}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_j}+\frac{\partial{L}}{\partial{\hat{\sigma_j}}}\frac{\frac{2}{N}\left(X_{i,j}-\mu_j\right)}{2\sqrt{\frac{1}{N}\sum_{i=1}^N{\left(X_{i,j}-\mu_j\right)^2}}}+\frac{\partial{L}}{\partial{\mu_j}}\frac{1}{N}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_j}+\frac{\partial{L}}{\partial{\hat{\sigma_j}}}\frac{\frac{2}{N}\left(X_{i,j}-\mu_j\right)}{2\hat{\sigma}_j}+\frac{\partial{L}}{\partial{\mu_j}}\frac{1}{N}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_j}+\frac{\partial{L}}{\partial{\hat{\sigma_j}}}\frac{1}{N}\hat{X}_{i,j}+\frac{\partial{L}}{\partial{\mu_j}}\frac{1}{N}\\&= \frac{\partial{L}}{\partial{\hat{X}_{i,j}}}\frac{1}{\hat{\sigma}_j}-\left(\frac{1}{\hat{\sigma}_j}\sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}\hat{X}_{k,j}}\right)\frac{1}{N}\hat{X}_{i,j}-\sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}}\left(\frac{1}{\hat{\sigma}_j}\right)\frac{1}{N}\\&= \frac{1}{N}\frac{1}{\hat{\sigma}_j}\left(N\frac{\partial{L}}{\partial{\hat{X}_{i,j}}}-\hat{X}_{i,j}\sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}\hat{X}_{k,j}}-\sum_{k=1}^N{\frac{\partial{L}}{\partial{\hat{X}_{k,j}}}}\right)\\&= \frac{\gamma_j}{N\hat{\sigma}_j}\left(N\frac{\partial{L}}{\partial{Y_{i,j}}}-\hat{X}_{i,j}\sum_{k=1}^N{\frac{\partial{L}}{\partial{Y_{k,j}}}\hat{X}_{k,j}}-\sum_{k=1}^N{\frac{\partial{L}}{\partial{Y_{k,j}}}}\right) \end{aligned}\tag{4.31} ∂Xi,j∂L=∂X^i,j∂L∂Xi,j∂X^i,j+∂σj^∂L∂Xi,j∂σj^+∂μj∂L∂Xi,j∂μj=∂X^i,j∂Lσ^j1+∂σj^∂L2N1∑i=1N(Xi,j−μj)2N2(Xi,j−μj)+∂μj∂LN1=∂X^i,j∂Lσ^j1+∂σj^∂L2σ^jN2(Xi,j−μj)+∂μj∂LN1=∂X^i,j∂Lσ^j1+∂σj^∂LN1X^i,j+∂μj∂LN1=∂X^i,j∂Lσ^j1−(σ^j1k=1∑N∂X^k,j∂LX^k,j)N1X^i,j−k=1∑N∂X^k,j∂L(σ^j1)N1=N1σ^j1(N∂X^i,j∂L−X^i,jk=1∑N∂X^k,j∂LX^k,j−k=1∑N∂X^k,j∂L)=Nσ^jγj(N∂Yi,j∂L−X^i,jk=1∑N∂Yk,j∂LX^k,j−k=1∑N∂Yk,j∂L)(4.31)