批归一化
通过批归一化,可以让超参数的搜索变得简单一些。在施加了批归一化后,神经网络对于超参数的敏感度会降低,具有更强的鲁棒性。
批归一化将每层传输给激励函数的值都进行归一化,相当于把每层隐藏层都看作单独的神经网络,将输入数据进行归一化,降低了网络之间的耦合。
跟对输入数据的归一化相似,批量归一化做的工作也差不多,不过归一化的对象变成了神经网络中传入激活函数的值
z
[
l
]
(
i
)
z^{[l](i)}
z[l](i),简单来说就是我们求出
μ
[
l
]
=
1
m
∑
i
=
1
m
z
[
l
]
(
i
)
σ
2
[
l
]
=
1
m
∑
i
=
1
m
(
z
[
l
]
(
i
)
−
μ
[
l
]
)
2
\mu^{[l]}=\frac{1}{m}\sum_{i=1}^mz^{[l](i)}\\ \sigma^{2[l]}=\frac{1}{m}\sum_{i=1}^m(z^{[l](i)}-\mu^{[l]})^2
μ[l]=m1i=1∑mz[l](i)σ2[l]=m1i=1∑m(z[l](i)−μ[l])2然后把
z
[
l
]
(
i
)
z^{[l](i)}
z[l](i)化为均值为0,方差为1的数据:
z
norm
[
l
]
(
i
)
=
z
[
l
]
(
i
)
−
μ
[
l
]
σ
2
[
l
]
+
ε
z^{[l](i)}_\text{norm}=\frac{z^{[l](i)}-\mu^{[l]}}{\sqrt{\sigma^{2[l]}+\varepsilon}}
znorm[l](i)=σ2[l]+εz[l](i)−μ[l]
不过有些时候,我们希望自己定义
z
[
l
]
(
i
)
z^{[l](i)}
z[l](i)的均值与方差,可以通过两个参数
γ
\gamma
γ与
β
\beta
β(又是
β
\beta
β)来控制,即令
z
~
[
l
]
(
i
)
=
γ
[
l
]
z
norm
[
l
]
(
i
)
+
β
[
l
]
\tilde{z}^{[l](i)}=\gamma^{[l]} z^{[l](i)}_\text{norm}+\beta^{[l]}
z~[l](i)=γ[l]znorm[l](i)+β[l]来将
z
[
l
]
(
i
)
z^{[l](i)}
z[l](i)调整到需要的分布。
其中 γ [ l ] \gamma^{[l]} γ[l]和 β [ l ] \beta^{[l]} β[l]不需要手动设置,它们可以作为普通参数在收敛过程中直接学习。同时,因为在归一化的时候,所有 z [ l ] ( i ) z^{[l](i)} z[l](i)的平均值都会被调整为0,所以参数 b [ l ] b^{[l]} b[l]就不需要了,我们只保留 W [ l ] , γ [ l ] W^{[l]},\gamma^{[l]} W[l],γ[l]与 β [ l ] \beta^{[l]} β[l]。
所以,在向前传播时,我们的计算过程为
Z
[
l
]
=
W
[
l
]
A
[
l
−
1
]
μ
[
l
]
=
1
m
n
p
.
s
u
m
(
Z
[
l
]
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
σ
2
=
1
m
n
p
.
s
u
m
(
(
Z
[
l
]
−
μ
[
l
]
)
2
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
Z
norm
[
l
]
=
Z
[
l
]
−
μ
[
l
]
σ
2
[
l
]
+
ε
Z
~
[
l
]
=
γ
[
l
]
∗
Z
norm
[
l
]
+
β
[
l
]
A
[
l
]
=
g
[
l
]
(
Z
~
[
l
]
)
\begin{aligned} &Z^{[l]}=W^{[l]}A^{[l-1]}\\ &\mu^{[l]}=\frac{1}{m}np.sum(Z^{[l]},axis=1,keepdims=True)\\ &\sigma^2=\frac{1}{m}np.sum((Z^{[l]}-\mu^{[l]})^2,axis=1,keepdims=True)\\ &Z^{[l]}_\text{norm}=\frac{Z^{[l]}-\mu^{[l]}}{\sqrt{\sigma^{2[l]}+\varepsilon}}\\ &\tilde{Z}^{[l]}=\gamma^{[l]}*Z^{[l]}_\text{norm}+\beta^{[l]}\\ &A^{[l]}=g^{[l]}(\tilde{Z}^{[l]}) \end{aligned}
Z[l]=W[l]A[l−1]μ[l]=m1np.sum(Z[l],axis=1,keepdims=True)σ2=m1np.sum((Z[l]−μ[l])2,axis=1,keepdims=True)Znorm[l]=σ2[l]+εZ[l]−μ[l]Z~[l]=γ[l]∗Znorm[l]+β[l]A[l]=g[l](Z~[l])
向后传播计算过程为
d
Z
~
[
l
]
=
d
A
[
l
]
∗
g
[
l
]
′
(
Z
~
[
l
]
)
d
Z
norm
[
l
]
=
d
Z
~
[
l
]
∗
γ
[
l
]
d
β
[
l
]
=
1
m
n
p
.
s
u
m
(
d
Z
~
[
l
]
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
d
γ
[
l
]
=
1
m
n
p
.
s
u
m
(
d
Z
~
[
l
]
∗
Z
norm
[
l
]
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
d
σ
2
[
l
]
=
1
m
n
p
.
s
u
m
(
d
Z
norm
[
l
]
∗
(
Z
[
l
]
−
μ
[
l
]
)
(
−
(
σ
2
[
l
]
+
ε
)
−
3
2
2
)
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
d
μ
[
l
]
=
1
m
n
p
.
s
u
m
(
d
Z
norm
[
l
]
∗
−
1
σ
2
[
l
]
+
ε
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
+
d
σ
2
[
l
]
1
m
n
p
.
s
u
m
(
−
2
(
Z
[
l
]
−
μ
[
l
]
)
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
d
Z
[
l
]
=
1
σ
2
[
l
]
+
ε
∗
d
Z
norm
[
l
]
+
2
(
Z
[
l
]
−
μ
[
l
]
)
m
∗
d
σ
2
[
l
]
+
1
m
d
μ
[
l
]
d
W
[
l
]
=
1
m
d
Z
[
l
]
A
[
l
−
1
]
T
\begin{aligned} &d\tilde{Z}^{[l]}=dA^{[l]}*g^{[l]'}(\tilde{Z}^{[l]})\\ &dZ^{[l]}_\text{norm}=d\tilde{Z}^{[l]}*\gamma^{[l]} \\ &d\beta^{[l]}=\frac{1}{m}np.sum(d\tilde{Z}^{[l]},axis=1,keepdims=True)\\ &d\gamma^{[l]}=\frac{1}{m}np.sum(d\tilde{Z}^{[l]}*Z^{[l]}_\text{norm},axis=1,keepdims=True)\\ &d\sigma^{2[l]}=\frac{1}{m}np.sum(dZ^{[l]}_\text{norm}*(Z^{[l]}-\mu^{[l]})(\frac{-(\sigma^{2[l]}+\varepsilon)^{-\frac{3}{2}}}{2}),axis=1,keepdims=True) \\ &d\mu^{[l]}=\frac{1}{m}np.sum(dZ^{[l]}_\text{norm}*\frac{-1}{\sqrt{\sigma^{2[l]}+\varepsilon}},axis=1,keepdims=True)+d\sigma^{2[l]}\frac{1}{m}np.sum(-2(Z^{[l]}-\mu^{[l]}),axis=1,keepdims=True) \\ &dZ^{[l]}=\frac{1}{\sqrt{\sigma^{2[l]}+\varepsilon}}*dZ^{[l]}_\text{norm}+\frac{2(Z^{[l]}-\mu^{[l]})}{m}*d\sigma^{2[l]}+\frac{1}{m}d\mu^{[l]} \\ &dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}\\ \end{aligned}
dZ~[l]=dA[l]∗g[l]′(Z~[l])dZnorm[l]=dZ~[l]∗γ[l]dβ[l]=m1np.sum(dZ~[l],axis=1,keepdims=True)dγ[l]=m1np.sum(dZ~[l]∗Znorm[l],axis=1,keepdims=True)dσ2[l]=m1np.sum(dZnorm[l]∗(Z[l]−μ[l])(2−(σ2[l]+ε)−23),axis=1,keepdims=True)dμ[l]=m1np.sum(dZnorm[l]∗σ2[l]+ε−1,axis=1,keepdims=True)+dσ2[l]m1np.sum(−2(Z[l]−μ[l]),axis=1,keepdims=True)dZ[l]=σ2[l]+ε1∗dZnorm[l]+m2(Z[l]−μ[l])∗dσ2[l]+m1dμ[l]dW[l]=m1dZ[l]A[l−1]T
测试时,因为可能只有单组数据,我们无法直接求出 μ \mu μ和 σ 2 \sigma^2 σ2。因此在测试集中使用BN算法时,我们会利用训练集里的 μ \mu μ和 σ 2 \sigma^2 σ2的指数加权平均来作为估计值对测试数据进行批归一化。一般使用的深度学习框架还会提供类似的工具来估算均值和方差,事实上只要是合理的估算,BN算法在测试集上的鲁棒性是很强的。