@article{glorot2010understanding,
title={Understanding the difficulty of training deep feedforward neural networks},
author={Glorot, Xavier and Bengio, Yoshua},
pages={249–256},
year={2010}}
概
本文提出了Xavier参数初始化方法.
主要内容
在第
i
=
1
,
…
,
d
i=1, \ldots, d
i=1,…,d层:
s
i
=
z
i
W
i
+
b
i
z
i
+
1
=
f
(
s
i
)
,
\mathbf{s}^i=\mathbf{z}^i W^i+\mathbf{b}^i \\ \mathbf{z}^{i+1}= f(\mathbf{s}^i),
si=ziWi+bizi+1=f(si),
其中
z
i
\mathbf{z}^i
zi是第
i
i
i层的输入,
s
i
\mathbf{s}^i
si是激活前的值,
f
(
⋅
)
f(\cdot)
f(⋅)是激活函数(假设其在0点对称, 且
f
′
(
0
)
=
1
f'(0)=1
f′(0)=1 如tanh).
则
V
a
r
(
z
i
)
=
n
l
V
a
r
(
w
i
z
i
)
,
\mathrm{Var}(z^i) = n_l\mathrm{Var}(w^iz^i),
Var(zi)=nlVar(wizi),
在
0
0
0附近近似成立(既然
f
′
(
0
)
=
1
f'(0)=1
f′(0)=1), 其中
z
i
,
w
i
,
z^i, w^i,
zi,wi,分别是
z
i
,
W
i
\mathbf{z}^i, W^i
zi,Wi的某个元素, 且假设这些
{
w
i
}
\{w^i\}
{wi}之间是独立同分布的,
w
i
,
z
i
w^i, z^i
wi,zi是相互独立的, 进一步假设
E
(
w
i
)
=
0
,
E
(
x
)
=
0
\mathbb{E}(w^i)=0,\mathbb{E}(x)=0
E(wi)=0,E(x)=0(
x
x
x是输入的样本), 则
V
a
r
(
z
i
)
=
n
l
V
a
r
(
w
i
)
V
a
r
(
z
i
)
,
\mathrm{Var}(z^i) = n_l\mathrm{Var}(w^i)\mathrm{Var}(z^i),
Var(zi)=nlVar(wi)Var(zi),
在
0
0
0点附近近似成立.
故
V
a
r
(
z
i
)
=
V
a
r
(
x
)
∏
i
′
=
0
i
−
1
n
i
′
V
a
r
(
w
i
′
)
\mathrm{Var}(z^i) = \mathrm{Var}(x) \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}(w_{i'})
Var(zi)=Var(x)i′=0∏i−1ni′Var(wi′)
其中
n
i
n_i
ni表示第
i
i
i层输入的节点个数.
根据梯度反向传播可知:
∂
C
o
s
t
∂
s
k
i
=
f
′
(
s
k
i
)
W
k
,
⋅
i
+
1
∂
C
o
s
t
∂
s
i
+
1
(2)
\tag{2} \frac{\partial Cost}{\partial s_k^i} = f'(s_k^i) W_{k, \cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf{s}^{i+1}}
∂ski∂Cost=f′(ski)Wk,⋅i+1∂si+1∂Cost(2)
∂
C
o
s
t
∂
w
l
,
k
i
=
z
l
i
∂
C
o
s
t
∂
s
k
i
.
(3)
\tag{3} \frac{\partial Cost}{\partial w_{l,k}^i} = z_l^i \frac{\partial Cost}{\partial s_k^i}.
∂wl,ki∂Cost=zli∂ski∂Cost.(3)
于是
V
a
r
[
∂
C
o
s
t
∂
s
k
i
]
=
V
a
r
[
∂
C
o
s
t
∂
s
d
]
∏
i
′
=
i
d
n
i
′
+
1
V
a
r
[
w
i
′
]
,
(6)
\tag{6} \mathrm{Var}[\frac{\partial Cost}{\partial s_k^i}] = \mathrm{Var}[\frac{\partial Cost}{\partial s^d}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}],
Var[∂ski∂Cost]=Var[∂sd∂Cost]i′=i∏dni′+1Var[wi′],(6)
V
a
r
[
∂
C
o
s
t
∂
w
i
]
=
∏
i
′
=
0
i
−
1
n
i
′
V
a
r
[
w
i
′
]
∏
i
′
=
i
d
n
i
′
+
1
V
a
r
[
w
i
′
]
×
V
a
r
(
x
)
V
a
r
[
∂
C
o
s
t
∂
s
d
]
,
\mathrm{Var}[\frac{\partial Cost}{\partial w^i}] = \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}[w^{i'}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}] \times \mathrm{Var}(x) \mathrm{Var}[\frac{\partial Cost}{\partial s^d}],
Var[∂wi∂Cost]=i′=0∏i−1ni′Var[wi′]i′=i∏dni′+1Var[wi′]×Var(x)Var[∂sd∂Cost],
当我们要求前向进程中关于
z
i
z^i
zi的方差一致, 则
∀
i
,
n
i
V
a
r
[
w
i
]
=
1.
(10)
\tag{10} \forall i, \quad n_i \mathrm{Var} [w^i]=1.
∀i,niVar[wi]=1.(10)
当我们要求反向进程中梯度的方差
∂
C
o
s
t
∂
s
i
\frac{\partial Cost}{\partial s^i}
∂si∂Cost一致, 则
∀
i
n
i
+
1
V
a
r
[
w
i
]
=
1.
(11)
\tag{11} \forall i \quad n_{i+1} \mathrm{Var} [w^i]=1.
∀ini+1Var[wi]=1.(11)
本文选了一个折中的方案
V
a
r
[
w
i
]
=
2
n
i
+
1
+
n
i
,
\mathrm{Var} [w^i] = \frac{2}{n_{i+1}+n_{i}},
Var[wi]=ni+1+ni2,
并构造了一个均匀分布,
w
i
w^i
wi从其中采样
w
i
∼
U
[
−
6
n
i
+
1
+
n
i
,
6
n
i
+
1
+
n
i
]
.
w^i \sim U[-\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}},\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}}].
wi∼U[−ni+1+ni6,ni+1+ni6].
文章还有许多关于不同的激活函数的分析, 如sigmoid, tanh, softsign… 这些不是重点, 就不记录了.