一、引言
神经网络的训练过程中的参数学习是基于梯度下降法进行优化的。梯度下降法需要在开始训练时给每一个参数赋一个初始值。这个初始值的选取十分关键。目前,深度学习存在以下几种参数初始化策略:
- Constant Initialization;
- Random Initialization;
- Xavier Initialization;
- Kaiming Initialization;
本文首先将简单介绍一下Constant Initialization和Random Initialization,然后将详细介绍一下Xavier Initialization和Kaiming Initialization,并给出推导过程;
在深入介绍参数初始化之前,首先回顾随机变量方差的两个性质(推导过程中会用到):
- 假设随机变量
X
X
X和随机变量
Y
Y
Y相互独立,则有
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) \begin{aligned} Var(X + Y) = Var(X) + Var(Y) \end{aligned} Var(X+Y)=Var(X)+Var(Y) - 假设随机变量
X
X
X和随机变量
Y
Y
Y相互独立,且有
E
(
X
)
=
E
(
Y
)
=
0
E(X) = E(Y) = 0
E(X)=E(Y)=0,则有
V a r ( X Y ) = V a r ( X ) V a r ( Y ) \begin{aligned} Var(XY) = Var(X) Var(Y) \end{aligned} Var(XY)=Var(X)Var(Y)
一般地,神经网络的隐层状态计算表达式为:
z
i
=
W
i
h
i
−
1
,
i
=
0
,
1
,
.
.
.
\begin{aligned} \mathbf{z}^{i} = \mathbf{W}^{i}\mathbf{h}^{i - 1}, i = 0,1,... \end{aligned}
zi=Wihi−1,i=0,1,...
激活值计算表达式为:
h
i
=
f
(
z
i
)
\begin{aligned} \mathbf{h}^{i} = f(\mathbf{z}^{i}) \end{aligned}
hi=f(zi)
其中, h 0 = x \mathbf{h}^0 = \mathbf{x} h0=x, x \mathbf{x} x是神经网络的输入, f ( ⋅ ) f(\cdot) f(⋅)是激活函数, W i ∈ R n i × n i − 1 \mathbf{W}^i \in \mathbb{R}^{n_{i} \times n_{i-1}} Wi∈Rni×ni−1, z i ∈ R n i \mathbf{z}^i \in \mathbb{R}^{n_i} zi∈Rni, h i − 1 ∈ R n i − 1 \mathbf{h}^{i - 1} \in \mathbb{R}^{n_{i-1}} hi−1∈Rni−1( n i n_{i} ni表示隐层 i i i的神经元个数)。 z \mathbf{z} z称为状态值, h \mathbf{h} h称为激活值。
状态梯度的表达式为:
∂
L
∂
z
k
i
=
f
′
(
z
k
i
)
(
W
⋅
,
k
i
+
1
)
T
∂
L
∂
z
i
+
1
,
k
=
1
,
2
,
.
.
.
,
n
i
\begin{aligned} \frac{\partial L}{\partial z^i_k} = f'(z^i_k)(\mathbf{W}^{i + 1}_{\cdot, k})^{T} \frac{\partial L}{\partial \mathbf{z}^{i + 1}}, k = 1,2,...,n_{i} \end{aligned}
∂zki∂L=f′(zki)(W⋅,ki+1)T∂zi+1∂L,k=1,2,...,ni
其中,
L
L
L为损失函数。
二、Constant Initialization
Constant Initialization将神经网络中的参数全部初始化为某个常数,这意味着将所有计算单元初始化为完全相同的状态,这会使每个计算单元对同一样例的输出和反向更新的梯度存在某种对称关系或甚至完全相同,导致神经网络的灵活性大打折扣。
三、Random Initialization
Random Initialization将每个计算单元初始化成不同的状态,一般我们希望数据和参数的均值都为0,输入和输出数据的方差一致。在实际应用中,参数服从高斯分布或者均匀分布都是比较有效的初始化方式。然而,很难选择概率模型中的超参数,例如正态分布 N o r m a l ( μ , σ 2 ) Normal(\mu, \sigma^2) Normal(μ,σ2)中的 μ \mu μ和 σ \sigma σ,均匀分布 U n i f o r m ( a , b ) Uniform(a,b) Uniform(a,b)中的 a a a和 b b b。
四、Xavier Initialization
Xavier Initialization由Glorot Xavier等人提出。好的初始化应该满足以下两个条件:
- 正向传播时,激活值的方差保持不变;
- 反向传播时,关于状态值的梯度的方差保持不变。
∀ ( i , j ) , V a r ( h i ) = V a r ( h j ) ∀ ( i , j ) , V a r ( ∂ L ∂ z i ) = V a r ( ∂ L ∂ z j ) \begin{aligned} \forall(i, j),Var(h^i) &= Var(h^j) \\ \forall(i, j),Var(\frac{\ \partial{L}}{ \partial{z^i}}) &= Var(\frac{\ \partial{L}}{ \partial{z^j}}) \end{aligned} ∀(i,j),Var(hi)∀(i,j),Var(∂zi ∂L)=Var(hj)=Var(∂zj ∂L)
上述这两个条件称为Glorot条件。
此外,对于前向过程,引入如下的几个假设:
- 权重矩阵 W \mathbf{W} W、输入 x \mathbf{x} x之间相互独立;
- W \mathbf{W} W i.i.d 并且 E ( W ) = 0 E(W) = 0 E(W)=0;
- x \mathbf{x} x i.i.d 并且 E ( x ) = 0 E(x) = 0 E(x)=0;
对于反向过程,引入如下的几个假设:
- 权重矩阵 W \mathbf{W} W、梯度 ∂ L ∂ z i \frac{\partial L}{\partial z^i} ∂zi∂L之间相互独立;
- W \mathbf{W} W i.i.d 并且 E ( W ) = 0 E(W) = 0 E(W)=0;
- ∂ L ∂ z i \frac{\partial L}{\partial z^i} ∂zi∂L i.i.d 并且 E ( ∂ L ∂ z i ) = 0 E(\frac{\partial L}{\partial z^i}) = 0 E(∂zi∂L)=0;
对于激活函数,引入如下的几个假设:
- 激活函数关于原点对称——每层的输入均值都是0;
- 激活函数在零点的值为1—— f ′ ( 0 ) = 1 f'(0) = 1 f′(0)=1;
- 初始时,状态值落在激活函数的线性区域—— f ′ ( z k i ) ≈ 1 f'(z^i_k) \thickapprox 1 f′(zki)≈1;
1. 前向过程
根据隐层状态计算公式和上述的两个方差的性质,有如下推导:
V
a
r
(
h
i
)
=
V
a
r
(
f
(
z
i
)
)
=
V
a
r
(
z
i
)
=
V
a
r
(
∑
k
=
1
n
i
−
1
W
⋅
k
i
h
k
i
−
1
)
=
∑
k
=
1
n
i
−
1
V
a
r
(
W
⋅
k
i
h
k
i
−
1
)
=
n
i
−
1
V
a
r
(
W
i
)
V
a
r
(
h
i
−
1
)
=
n
i
−
1
n
i
−
2
V
a
r
(
W
i
)
V
a
r
(
W
i
−
1
)
V
a
r
(
h
i
−
2
)
=
.
.
.
=
V
a
r
(
x
)
∏
k
=
1
i
n
k
−
1
V
a
r
(
W
k
)
\begin{aligned} Var(h^i) &= Var(f(z^i)) \\ &= Var(z^i) \\ &= Var(\sum_{k = 1}^{n_{i-1}} \mathbf{W}_{\cdot k}^{i}h_k^{i-1}) \\ &= \sum_{k = 1}^{n_{i-1}} Var(\mathbf{W}_{\cdot k}^{i}h_k^{i-1}) \\ &= n_{i - 1} Var(W^{i}) Var(h^{i-1}) \\ &= n_{i - 1}n_{i -2}Var(W^i)Var(W^{i - 1})Var(h^{i - 2}) \\ &= ... \\ &= Var(x)\prod_{k =1}^i n_{k-1}Var(W^k) \end{aligned}
Var(hi)=Var(f(zi))=Var(zi)=Var(k=1∑ni−1W⋅kihki−1)=k=1∑ni−1Var(W⋅kihki−1)=ni−1Var(Wi)Var(hi−1)=ni−1ni−2Var(Wi)Var(Wi−1)Var(hi−2)=...=Var(x)k=1∏ink−1Var(Wk)
根据Glorot条件中的第一个条件,结合上述推导,有:
∀
i
,
n
i
−
1
V
a
r
(
W
i
)
=
1
\begin{aligned} \forall i,n_{i - 1}Var(W^{i}) = 1 \end{aligned}
∀i,ni−1Var(Wi)=1
2. 反向过程
根据状态梯度公式和上述的两个方差的性质,有如下推导:
V
a
r
(
∂
L
∂
z
i
)
=
V
a
r
(
(
W
⋅
,
k
i
+
1
)
T
∂
L
∂
z
i
+
1
)
=
V
a
r
(
∑
k
=
1
n
i
+
1
W
⋅
k
i
+
1
∂
L
∂
z
k
i
+
1
)
=
n
i
+
1
V
a
r
(
W
i
+
1
∂
L
∂
z
i
+
1
)
=
n
i
+
1
V
a
r
(
W
i
+
1
)
V
a
r
(
∂
L
∂
z
i
+
1
)
=
n
i
+
1
n
i
+
2
V
a
r
(
W
i
+
1
)
V
a
r
(
W
i
+
2
)
V
a
r
(
∂
L
∂
z
i
+
2
)
=
.
.
.
=
V
a
r
(
∂
L
∂
z
d
)
∏
k
=
i
+
1
d
n
k
V
a
r
(
W
k
)
\begin{aligned} Var(\frac{\partial L}{\partial z^i}) &= Var((\mathbf{W}^{i + 1}_{\cdot, k})^T \frac{\partial L}{\partial \mathbf{z}^{i +1}}) \\ &= Var(\sum_{k = 1}^{n_{i + 1}} \mathbf{W}^{i+1}_{\cdot k} \frac{\partial L}{\partial z^{i+ 1}_k}) \\ &= n_{i+1} Var(W^{i+1} \frac{\partial L}{\partial z^{i+1}}) \\ &= n_{i+1} Var(W^{i+1}) Var(\frac{\partial L}{\partial z^{i+1}}) \\ &= n_{i+1} n_{i+2} Var(W^{i+1}) Var(W^{i+2}) Var(\frac{\partial L}{\partial z^{i+2}}) \\ &= ... \\ &= Var(\frac{\partial L}{\partial z^{d}}) \prod_{k = i+1}^d n_{k} Var(W^k) \end{aligned}
Var(∂zi∂L)=Var((W⋅,ki+1)T∂zi+1∂L)=Var(k=1∑ni+1W⋅ki+1∂zki+1∂L)=ni+1Var(Wi+1∂zi+1∂L)=ni+1Var(Wi+1)Var(∂zi+1∂L)=ni+1ni+2Var(Wi+1)Var(Wi+2)Var(∂zi+2∂L)=...=Var(∂zd∂L)k=i+1∏dnkVar(Wk)
根据Glorot条件中的第二个条件,结合上述推导,有:
∀
i
,
n
i
V
a
r
(
W
i
)
=
1
\begin{aligned} \forall i,n_{i}Var(W^{i}) = 1 \end{aligned}
∀i,niVar(Wi)=1
为了同时满足两个条件,Glorot取了调和平均值:
∀
i
,
V
a
r
(
W
i
)
=
2
n
i
−
1
+
n
i
\begin{aligned} \forall i,Var(W^i) = \frac{2}{n_{i-1} + n_{i}} \end{aligned}
∀i,Var(Wi)=ni−1+ni2
这里,
n
i
−
1
n_{i-1}
ni−1和
n
i
n_i
ni分别看作第
i
i
i层的输入维度和输出维度。进一步得出初始化方式:
- 若 W W W服从正态分布,那么 W ∼ N o r m a l ( 0 , 2 n i − 1 + n i ) W \thicksim Normal(0, \frac{2}{n_{i-1} + n_{i}}) W∼Normal(0,ni−1+ni2);
- 若 W W W服从均匀分布,那么 W ∼ U n i f o r m ( − 6 n i − 1 + n i , 6 n i − 1 + n i ) W \thicksim Uniform(-\sqrt{\frac{6}{n_{i-1} + n_{i}}}, \sqrt{\frac{6}{n_{i-1} + n_{i}}}) W∼Uniform(−ni−1+ni6,ni−1+ni6);
Kaiming Initialization
Xavier Initialization假设激活函数是线性的,这并不适用于ReLU激活函数,而Kaiming Initialization正是为了解决这个问题而提出的。Kaming Initialization由ResNet的作者Kaiming He提出的,主要应用在激活函数是ReLU的网络中。
1. 前向过程
因为激活函数是ReLU,所以
E
(
h
i
−
1
)
=
0
E(h^{i-1}) = 0
E(hi−1)=0不再成立。因此,推导的过程和Xavier Initialization会有些不同,但大致是相同的,具体推导过程如下:
V
a
r
(
z
i
)
=
n
i
−
1
V
a
r
(
W
i
h
i
−
1
)
=
n
i
−
1
[
E
(
(
W
i
h
i
−
1
)
2
)
−
(
E
(
W
i
h
i
−
1
)
)
2
]
=
n
i
−
1
[
E
(
(
W
i
)
2
)
E
(
(
h
i
−
1
)
2
)
−
(
E
(
W
i
)
)
2
(
E
(
h
i
−
1
)
)
2
]
\begin{aligned} Var(z^i) &= n_{i-1} Var(W^ih^{i - 1}) \\ &= n_{i-1} [E((W^ih^{i - 1})^2) - (E(W^ih^{i - 1}))^2] \\ &= n_{i-1} [E((W^i)^2) E((h^{i-1})^2) - (E(W^i))^2(E(h^{i-1}))^2] \\ \end{aligned}
Var(zi)=ni−1Var(Wihi−1)=ni−1[E((Wihi−1)2)−(E(Wihi−1))2]=ni−1[E((Wi)2)E((hi−1)2)−(E(Wi))2(E(hi−1))2]
又因为
E
(
W
i
)
=
0
E(W^i) = 0
E(Wi)=0,所以有:
V
a
r
(
z
i
)
=
n
i
−
1
E
(
(
W
i
)
2
)
E
(
(
h
i
−
1
)
2
)
=
n
i
−
1
[
E
(
(
W
i
)
2
)
−
(
E
(
W
i
)
)
2
]
E
(
(
h
i
−
1
)
2
)
=
n
i
−
1
V
a
r
(
W
i
)
E
(
(
h
i
−
1
)
2
)
\begin{aligned} Var(z^i) &= n_{i-1} E((W^i)^2) E((h^{i-1})^2) \\ &= n_{i-1} [E((W^i)^2) - (E(W^i))^2] E((h^{i-1})^2) \\ &= n_{i-1} Var(W^i) E((h^{i-1})^2) \end{aligned}
Var(zi)=ni−1E((Wi)2)E((hi−1)2)=ni−1[E((Wi)2)−(E(Wi))2]E((hi−1)2)=ni−1Var(Wi)E((hi−1)2)
而
E
(
(
h
i
−
1
)
2
)
=
E
(
f
2
(
z
i
−
1
)
)
=
∫
−
∞
∞
p
(
z
i
−
1
)
f
2
(
z
i
−
1
)
d
z
i
−
1
=
0
+
∫
0
∞
p
(
z
i
−
1
)
(
z
i
−
1
)
2
d
z
i
−
1
=
1
2
∫
−
∞
∞
p
(
z
i
−
1
)
(
z
i
−
1
)
2
d
z
i
−
1
=
1
2
E
(
(
z
i
−
1
)
2
)
=
1
2
V
a
r
(
z
i
−
1
)
\begin{aligned} E((h^{i-1})^2) &= E(f^2(z^{i-1})) \\ &= \int_{-\infty}^{\infty} p(z^{i-1}) f^2(z^{i-1}) d z^{i-1} \\ &= 0 + \int_{0}^{\infty} p(z^{i-1}) (z^{i-1})^2 d z^{i-1} \\ &= \frac{1}{2} \int_{-\infty}^{\infty} p(z^{i-1}) (z^{i-1})^2 d z^{i-1} \\ &= \frac{1}{2} E((z^{i-1})^2) \\ &= \frac{1}{2} Var(z^{i-1}) \end{aligned}
E((hi−1)2)=E(f2(zi−1))=∫−∞∞p(zi−1)f2(zi−1)dzi−1=0+∫0∞p(zi−1)(zi−1)2dzi−1=21∫−∞∞p(zi−1)(zi−1)2dzi−1=21E((zi−1)2)=21Var(zi−1)
从而
V
a
r
(
z
i
)
=
1
2
n
i
−
1
V
a
r
(
W
i
)
V
a
r
(
z
i
−
1
)
=
V
a
r
(
z
1
)
∏
k
=
2
i
[
1
2
n
k
−
1
V
a
r
(
W
k
)
]
\begin{aligned} Var(z^i) &= \frac{1}{2} n_{i-1} Var(W^i) Var(z^{i-1}) \\ &= Var(z^1) \prod_{k = 2}^i [\frac{1}{2} n_{k-1} Var(W^k)] \end{aligned}
Var(zi)=21ni−1Var(Wi)Var(zi−1)=Var(z1)k=2∏i[21nk−1Var(Wk)]
又因为各层激活值的方差在传播过程中要保持一致,因此有:
∀
i
,
1
2
n
i
−
1
V
a
r
(
W
i
)
=
1
\begin{aligned} \forall i, \frac{1}{2} n_{i-1} Var(W^i) = 1 \end{aligned}
∀i,21ni−1Var(Wi)=1
2. 反向过程
根据状态梯度的公式,有如下推导:
V
a
r
(
∂
L
∂
z
i
)
=
V
a
r
(
f
′
(
z
i
)
(
W
⋅
,
k
i
+
1
)
T
∂
L
∂
z
i
+
1
)
=
V
a
r
(
(
W
⋅
,
k
i
+
1
)
T
∂
L
∂
h
i
+
1
)
=
n
i
+
1
V
a
r
(
W
i
+
1
)
V
a
r
(
∂
L
∂
h
i
+
1
)
\begin{aligned} Var(\frac{\partial L}{\partial z^i}) &= Var(f'(z^i)(\mathbf{W}^{i + 1}_{\cdot, k})^T \frac{\partial L}{\partial \mathbf{z}^{i +1}}) \\ &= Var((\mathbf{W}^{i + 1}_{\cdot, k})^T \frac{\partial L}{\partial \mathbf{h}^{i +1}}) \\ &= n_{i+1} Var(W^{i+1}) Var(\frac{\partial L}{\partial h^{i+1}}) \end{aligned}
Var(∂zi∂L)=Var(f′(zi)(W⋅,ki+1)T∂zi+1∂L)=Var((W⋅,ki+1)T∂hi+1∂L)=ni+1Var(Wi+1)Var(∂hi+1∂L)
又因为
V
a
r
(
∂
L
∂
h
i
+
1
)
=
V
a
r
(
∂
L
∂
z
i
+
1
)
V
a
r
(
f
′
(
z
i
)
)
=
1
2
V
a
r
(
∂
L
∂
z
i
+
1
)
\begin{aligned} Var(\frac{\partial L}{\partial h^{i+1}}) &= Var(\frac{\partial L}{\partial z^{i+1}}) Var(f'(z^i)) \\ &= \frac{1}{2} Var(\frac{\partial L}{\partial z^{i+1}}) \end{aligned}
Var(∂hi+1∂L)=Var(∂zi+1∂L)Var(f′(zi))=21Var(∂zi+1∂L)
因此
V
a
r
(
∂
L
∂
z
i
)
=
1
2
n
i
+
1
V
a
r
(
W
i
+
1
)
V
a
r
(
∂
L
∂
z
i
+
1
)
=
V
a
r
(
∂
L
∂
z
d
)
∏
k
=
i
+
1
d
[
1
2
n
k
V
a
r
(
W
k
)
]
\begin{aligned} Var(\frac{\partial L}{\partial z^i}) &= \frac{1}{2} n_{i+1} Var(W^{i+1}) Var(\frac{\partial L}{\partial z^{i+1}}) \\ &= Var(\frac{\partial L}{\partial z^{d}}) \prod_{k = i+1}^d [\frac{1}{2} n_{k} Var(W^k)] \end{aligned}
Var(∂zi∂L)=21ni+1Var(Wi+1)Var(∂zi+1∂L)=Var(∂zd∂L)k=i+1∏d[21nkVar(Wk)]
而状态梯度的方差在传播过程中要保持一致,所以有:
∀
i
,
1
2
n
i
V
a
r
(
W
i
)
=
1
\begin{aligned} \forall i, \frac{1}{2} n_{i} Var(W^i) = 1 \end{aligned}
∀i,21niVar(Wi)=1
进一步得出初始化方式:
- 若 W W W服从正态分布,那么 W ∼ N o r m a l ( 0 , 2 n i ) W \thicksim Normal(0, \frac{2}{n_{i}}) W∼Normal(0,ni2);
- 若 W W W服从均匀分布,那么 W ∼ U n i f o r m ( − 6 n i , 6 n i ) W \thicksim Uniform(-\sqrt{\frac{6}{n_{i}}}, \sqrt{\frac{6}{n_{i}}}) W∼Uniform(−ni6,ni6);