为什么神经网络不能全零初始化
相信所有学习过神经网络的人都知道神经网络的权重和偏置不能0初始化,但并不是所有人都知道为什么,在这里我们通过举例子+数学推导的方式解释原因。
假设要学习的神经网络结构如下所示:
初始化隐层参数为
W
(
1
)
=
[
w
11
(
1
)
w
12
(
1
)
w
13
(
1
)
w
21
(
1
)
w
22
(
1
)
w
23
(
1
)
]
=
[
0
0
0
0
0
0
]
b
(
1
)
=
[
b
1
(
1
)
b
2
(
1
)
]
T
=
[
0
0
]
T
\boldsymbol{W}^{(1)}= \left[ \begin{matrix} w^{(1)}_{11} & w^{(1)}_{12} & w^{(1)}_{13}\\ w^{(1)}_{21} & w^{(1)}_{22} & w^{(1)}_{23}\\ \end{matrix} \right]= \left[ \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 0\\ \end{matrix} \right]\\ \boldsymbol{b}^{(1)}= \left[ \begin{matrix} b^{(1)}_{1} & b^{(1)}_{2} \\ \end{matrix} \right]^T= \left[ \begin{matrix} 0 & 0 \\ \end{matrix}\right]^T
W(1)=[w11(1)w21(1)w12(1)w22(1)w13(1)w23(1)]=[000000]b(1)=[b1(1)b2(1)]T=[00]T
同理,输出层参数为
W
(
2
)
=
[
w
11
(
2
)
w
12
(
2
)
]
=
[
0
0
]
b
(
2
)
=
[
b
1
(
2
)
]
T
=
[
0
]
T
\boldsymbol{W}^{(2)}=\left[ \begin{matrix} w^{(2)}_{11} & w^{(2)}_{12}\\ \end{matrix} \right]=\left[ \begin{matrix} 0 & 0 \\ \end{matrix} \right]\\ \boldsymbol{b}^{(2)}=\left[ \begin{matrix} b^{(2)}_{1} \\ \end{matrix} \right]^T=\left[ \begin{matrix} 0 \\ \end{matrix} \right]^T
W(2)=[w11(2)w12(2)]=[00]b(2)=[b1(2)]T=[0]T
隐层得到的输入和隐层的输出为
z
(
1
)
=
W
(
1
)
x
+
b
(
1
)
a
(
1
)
=
f
(
z
(
1
)
)
\boldsymbol{z}^{(1)}=\boldsymbol{W}^{(1)}\boldsymbol{x}+\boldsymbol{b}^{(1)}\\ \boldsymbol{a}^{(1)}=f(\boldsymbol{z}^{(1)})
z(1)=W(1)x+b(1)a(1)=f(z(1))
其中
f
f
f为激活函数,为方便之后的计算,写成标量形式
z
1
(
1
)
=
w
11
(
1
)
x
1
+
w
12
(
1
)
x
2
+
w
13
(
1
)
x
3
+
b
1
(
1
)
z
1
(
2
)
=
w
21
(
1
)
x
1
+
w
22
(
1
)
x
2
+
w
23
(
1
)
x
3
+
b
2
(
1
)
a
1
(
1
)
=
f
(
z
1
(
1
)
)
a
2
(
1
)
=
f
(
z
2
(
1
)
)
z^{(1)}_1=w^{(1)}_{11}x_1+w^{(1)}_{12}x_2+w^{(1)}_{13}x_3+b^{(1)}_{1}\\ z^{(2)}_1=w^{(1)}_{21}x_1+w^{(1)}_{22}x_2+w^{(1)}_{23}x_3+b^{(1)}_{2}\\ a^{(1)}_1=f(z^{(1)}_1)\\ a^{(1)}_2=f(z^{(1)}_2)
z1(1)=w11(1)x1+w12(1)x2+w13(1)x3+b1(1)z1(2)=w21(1)x1+w22(1)x2+w23(1)x3+b2(1)a1(1)=f(z1(1))a2(1)=f(z2(1))
输出层的输出为
y
^
=
w
11
(
2
)
a
1
(
1
)
+
w
12
(
2
)
a
2
(
1
)
+
b
1
(
2
)
\hat{y}=w^{(2)}_{11}a^{(1)}_1+w^{(2)}_{12}a^{(1)}_2+b^{(2)}_{1}
y^=w11(2)a1(1)+w12(2)a2(1)+b1(2)
损失函数记为
L
(
y
,
W
,
b
)
L(y,\boldsymbol{W},\boldsymbol{b})
L(y,W,b)
由于神经网络初始化为0,所以
z
1
(
1
)
=
z
2
(
1
)
=
0
a
1
(
1
)
=
a
2
(
1
)
y
^
=
0
z^{(1)}_1=z^{(1)}_2=0\\ a^{(1)}_1=a^{(1)}_2\\ \hat{y}=0
z1(1)=z2(1)=0a1(1)=a2(1)y^=0
令
∂
L
(
y
,
W
,
b
)
∂
y
^
=
σ
\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}=\sigma
∂y^∂L(y,W,b)=σ
第一次反向传播
∂
L
(
y
,
W
,
b
)
∂
w
11
(
2
)
=
∂
L
(
y
,
W
,
b
)
∂
y
^
∂
y
^
∂
w
11
(
2
)
=
σ
a
1
(
1
)
∂
L
(
y
,
W
,
b
)
∂
w
12
(
2
)
=
∂
L
(
y
,
W
,
b
)
∂
y
^
∂
y
^
∂
w
12
(
2
)
=
σ
a
2
(
1
)
∂
L
(
y
,
W
,
b
)
∂
b
1
(
2
)
=
∂
L
(
y
,
W
,
b
)
∂
y
^
∂
y
^
∂
b
1
(
2
)
=
σ
\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{11}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w^{(2)}_{11}}= \sigma a^{(1)}_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{12}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w^{(2)}_{12}}= \sigma a^{(1)}_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(2)}_{1}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial b^{(2)}_{1}}= \sigma
∂w11(2)∂L(y,W,b)=∂y^∂L(y,W,b)∂w11(2)∂y^=σa1(1)∂w12(2)∂L(y,W,b)=∂y^∂L(y,W,b)∂w12(2)∂y^=σa2(1)∂b1(2)∂L(y,W,b)=∂y^∂L(y,W,b)∂b1(2)∂y^=σ
显然损失函数对
w
11
(
2
)
w^{(2)}_{11}
w11(2)和
w
12
(
2
)
w^{(2)}_{12}
w12(2)的偏导数相同,因此一次更新后两者仍然相同。这里不妨假设更新后不为0。
对隐层求偏导
∂
L
(
y
,
W
,
b
)
∂
w
11
(
1
)
=
∂
L
(
y
,
W
,
b
)
∂
y
^
∂
y
^
∂
a
1
(
1
)
∂
a
1
(
1
)
∂
z
1
(
1
)
∂
z
1
(
1
)
∂
w
11
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
1
=
0
∂
L
(
y
,
W
,
b
)
∂
w
12
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
2
=
0
∂
L
(
y
,
W
,
b
)
∂
w
13
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
3
=
0
∂
L
(
y
,
W
,
b
)
∂
b
1
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
=
0
∂
L
(
y
,
W
,
b
)
∂
w
21
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
1
=
0
∂
L
(
y
,
W
,
b
)
∂
w
22
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
2
=
0
∂
L
(
y
,
W
,
b
)
∂
w
23
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
3
=
0
∂
L
(
y
,
W
,
b
)
∂
b
2
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
=
0
\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{11}} =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a^{(1)}_{1}} \frac{\partial a^{(1)}_{1}}{\partial z^{(1)}_1} \frac{\partial z^{(1)}_1}{\partial w^{(1)}_{11}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_1=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{12}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_2=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{13}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_3=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{1}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{21}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_1=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{22}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_2=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{23}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_3=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{2}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)=0\\
∂w11(1)∂L(y,W,b)=∂y^∂L(y,W,b)∂a1(1)∂y^∂z1(1)∂a1(1)∂w11(1)∂z1(1)=σw11(2)f′(z1(1))x1=0∂w12(1)∂L(y,W,b)=σw11(2)f′(z1(1))x2=0∂w13(1)∂L(y,W,b)=σw11(2)f′(z1(1))x3=0∂b1(1)∂L(y,W,b)=σw11(2)f′(z1(1))=0∂w21(1)∂L(y,W,b)=σw12(2)f′(z2(1))x1=0∂w22(1)∂L(y,W,b)=σw12(2)f′(z2(1))x2=0∂w23(1)∂L(y,W,b)=σw12(2)f′(z2(1))x3=0∂b2(1)∂L(y,W,b)=σw12(2)f′(z2(1))=0
因为偏导都为0,所以第一次反向传播隐层参数不更新,仍然为0。
第二次正向传播时,由于隐层参数为0,所以
z
1
(
1
)
=
z
2
(
1
)
=
0
a
1
(
1
)
=
a
2
(
1
)
z^{(1)}_1=z^{(1)}_2=0\\ a^{(1)}_1=a^{(1)}_2\\
z1(1)=z2(1)=0a1(1)=a2(1)
第二次反向传播
∂
L
(
y
,
W
,
b
)
∂
w
11
(
2
)
=
σ
a
1
(
1
)
∂
L
(
y
,
W
,
b
)
∂
w
12
(
2
)
=
σ
a
2
(
1
)
∂
L
(
y
,
W
,
b
)
∂
b
1
(
2
)
=
σ
\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{11}}=\sigma a^{(1)}_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{12}}=\sigma a^{(1)}_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(2)}_{1}}=\sigma
∂w11(2)∂L(y,W,b)=σa1(1)∂w12(2)∂L(y,W,b)=σa2(1)∂b1(2)∂L(y,W,b)=σ
损失函数对
w
11
(
2
)
w^{(2)}_{11}
w11(2)和
w
12
(
2
)
w^{(2)}_{12}
w12(2)的偏导数相同,更新后两者仍然相同。
对隐层
∂
L
(
y
,
W
,
b
)
∂
w
11
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
1
=
∂
L
(
y
,
W
,
b
)
∂
w
21
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
1
∂
L
(
y
,
W
,
b
)
∂
w
12
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
2
=
∂
L
(
y
,
W
,
b
)
∂
w
22
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
2
∂
L
(
y
,
W
,
b
)
∂
w
13
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
x
3
=
∂
L
(
y
,
W
,
b
)
∂
w
23
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
x
3
∂
L
(
y
,
W
,
b
)
∂
b
1
(
1
)
=
σ
w
11
(
2
)
f
′
(
z
1
(
1
)
)
=
∂
L
(
y
,
W
,
b
)
∂
b
2
(
1
)
=
σ
w
12
(
2
)
f
′
(
z
2
(
1
)
)
\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{11}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_1 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{21}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{12}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_2 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{22}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{13}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_3 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{23}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_3\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{1}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1) =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{2}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)\\
∂w11(1)∂L(y,W,b)=σw11(2)f′(z1(1))x1=∂w21(1)∂L(y,W,b)=σw12(2)f′(z2(1))x1∂w12(1)∂L(y,W,b)=σw11(2)f′(z1(1))x2=∂w22(1)∂L(y,W,b)=σw12(2)f′(z2(1))x2∂w13(1)∂L(y,W,b)=σw11(2)f′(z1(1))x3=∂w23(1)∂L(y,W,b)=σw12(2)f′(z2(1))x3∂b1(1)∂L(y,W,b)=σw11(2)f′(z1(1))=∂b2(1)∂L(y,W,b)=σw12(2)f′(z2(1))
那么更新后
w
1.
(
1
)
=
w
2.
(
1
)
b
1
(
1
)
=
b
2
(
1
)
\boldsymbol{w^{(1)}_{1.}}=\boldsymbol{w^{(1)}_{2.}}\\ b^{(1)}_1=b^{(1)}_2
w1.(1)=w2.(1)b1(1)=b2(1)
其实就是隐层所有神经元的参数都相同了。
可以预见的是,由于隐层神经元参数都相同,那么在之后
z
1
(
1
)
=
z
2
(
1
)
a
1
(
1
)
=
a
2
(
1
)
z^{(1)}_1=z^{(1)}_2\\ a^{(1)}_1=a^{(1)}_2\\
z1(1)=z2(1)a1(1)=a2(1)
反向传播时,仍会出现输出层两个权值相同,隐层神经元参数相同。
换句话说,隐层神经元仅相当于一个节点,这可能直接导致模型失效!