为什么神经网络不能全零初始化

为什么神经网络不能全零初始化

相信所有学习过神经网络的人都知道神经网络的权重和偏置不能0初始化,但并不是所有人都知道为什么,在这里我们通过举例子+数学推导的方式解释原因。

假设要学习的神经网络结构如下所示:

在这里插入图片描述

初始化隐层参数为
W ( 1 ) = [ w 11 ( 1 ) w 12 ( 1 ) w 13 ( 1 ) w 21 ( 1 ) w 22 ( 1 ) w 23 ( 1 ) ] = [ 0 0 0 0 0 0 ] b ( 1 ) = [ b 1 ( 1 ) b 2 ( 1 ) ] T = [ 0 0 ] T \boldsymbol{W}^{(1)}= \left[ \begin{matrix} w^{(1)}_{11} & w^{(1)}_{12} & w^{(1)}_{13}\\ w^{(1)}_{21} & w^{(1)}_{22} & w^{(1)}_{23}\\ \end{matrix} \right]= \left[ \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 0\\ \end{matrix} \right]\\ \boldsymbol{b}^{(1)}= \left[ \begin{matrix} b^{(1)}_{1} & b^{(1)}_{2} \\ \end{matrix} \right]^T= \left[ \begin{matrix} 0 & 0 \\ \end{matrix}\right]^T W(1)=[w11(1)w21(1)w12(1)w22(1)w13(1)w23(1)]=[000000]b(1)=[b1(1)b2(1)]T=[00]T
同理,输出层参数为
W ( 2 ) = [ w 11 ( 2 ) w 12 ( 2 ) ] = [ 0 0 ] b ( 2 ) = [ b 1 ( 2 ) ] T = [ 0 ] T \boldsymbol{W}^{(2)}=\left[ \begin{matrix} w^{(2)}_{11} & w^{(2)}_{12}\\ \end{matrix} \right]=\left[ \begin{matrix} 0 & 0 \\ \end{matrix} \right]\\ \boldsymbol{b}^{(2)}=\left[ \begin{matrix} b^{(2)}_{1} \\ \end{matrix} \right]^T=\left[ \begin{matrix} 0 \\ \end{matrix} \right]^T W(2)=[w11(2)w12(2)]=[00]b(2)=[b1(2)]T=[0]T
隐层得到的输入和隐层的输出为
z ( 1 ) = W ( 1 ) x + b ( 1 ) a ( 1 ) = f ( z ( 1 ) ) \boldsymbol{z}^{(1)}=\boldsymbol{W}^{(1)}\boldsymbol{x}+\boldsymbol{b}^{(1)}\\ \boldsymbol{a}^{(1)}=f(\boldsymbol{z}^{(1)}) z(1)=W(1)x+b(1)a(1)=f(z(1))
其中 f f f为激活函数,为方便之后的计算,写成标量形式
z 1 ( 1 ) = w 11 ( 1 ) x 1 + w 12 ( 1 ) x 2 + w 13 ( 1 ) x 3 + b 1 ( 1 ) z 1 ( 2 ) = w 21 ( 1 ) x 1 + w 22 ( 1 ) x 2 + w 23 ( 1 ) x 3 + b 2 ( 1 ) a 1 ( 1 ) = f ( z 1 ( 1 ) ) a 2 ( 1 ) = f ( z 2 ( 1 ) ) z^{(1)}_1=w^{(1)}_{11}x_1+w^{(1)}_{12}x_2+w^{(1)}_{13}x_3+b^{(1)}_{1}\\ z^{(2)}_1=w^{(1)}_{21}x_1+w^{(1)}_{22}x_2+w^{(1)}_{23}x_3+b^{(1)}_{2}\\ a^{(1)}_1=f(z^{(1)}_1)\\ a^{(1)}_2=f(z^{(1)}_2) z1(1)=w11(1)x1+w12(1)x2+w13(1)x3+b1(1)z1(2)=w21(1)x1+w22(1)x2+w23(1)x3+b2(1)a1(1)=f(z1(1))a2(1)=f(z2(1))
输出层的输出为
y ^ = w 11 ( 2 ) a 1 ( 1 ) + w 12 ( 2 ) a 2 ( 1 ) + b 1 ( 2 ) \hat{y}=w^{(2)}_{11}a^{(1)}_1+w^{(2)}_{12}a^{(1)}_2+b^{(2)}_{1} y^=w11(2)a1(1)+w12(2)a2(1)+b1(2)
损失函数记为 L ( y , W , b ) L(y,\boldsymbol{W},\boldsymbol{b}) L(y,W,b)

由于神经网络初始化为0,所以
z 1 ( 1 ) = z 2 ( 1 ) = 0 a 1 ( 1 ) = a 2 ( 1 ) y ^ = 0 z^{(1)}_1=z^{(1)}_2=0\\ a^{(1)}_1=a^{(1)}_2\\ \hat{y}=0 z1(1)=z2(1)=0a1(1)=a2(1)y^=0

∂ L ( y , W , b ) ∂ y ^ = σ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}=\sigma y^L(y,W,b)=σ
第一次反向传播
∂ L ( y , W , b ) ∂ w 11 ( 2 ) = ∂ L ( y , W , b ) ∂ y ^ ∂ y ^ ∂ w 11 ( 2 ) = σ a 1 ( 1 ) ∂ L ( y , W , b ) ∂ w 12 ( 2 ) = ∂ L ( y , W , b ) ∂ y ^ ∂ y ^ ∂ w 12 ( 2 ) = σ a 2 ( 1 ) ∂ L ( y , W , b ) ∂ b 1 ( 2 ) = ∂ L ( y , W , b ) ∂ y ^ ∂ y ^ ∂ b 1 ( 2 ) = σ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{11}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w^{(2)}_{11}}= \sigma a^{(1)}_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{12}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w^{(2)}_{12}}= \sigma a^{(1)}_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(2)}_{1}}= \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial b^{(2)}_{1}}= \sigma w11(2)L(y,W,b)=y^L(y,W,b)w11(2)y^=σa1(1)w12(2)L(y,W,b)=y^L(y,W,b)w12(2)y^=σa2(1)b1(2)L(y,W,b)=y^L(y,W,b)b1(2)y^=σ
显然损失函数对 w 11 ( 2 ) w^{(2)}_{11} w11(2) w 12 ( 2 ) w^{(2)}_{12} w12(2)的偏导数相同,因此一次更新后两者仍然相同。这里不妨假设更新后不为0。

对隐层求偏导
∂ L ( y , W , b ) ∂ w 11 ( 1 ) = ∂ L ( y , W , b ) ∂ y ^ ∂ y ^ ∂ a 1 ( 1 ) ∂ a 1 ( 1 ) ∂ z 1 ( 1 ) ∂ z 1 ( 1 ) ∂ w 11 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 1 = 0 ∂ L ( y , W , b ) ∂ w 12 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 2 = 0 ∂ L ( y , W , b ) ∂ w 13 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 3 = 0 ∂ L ( y , W , b ) ∂ b 1 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) = 0 ∂ L ( y , W , b ) ∂ w 21 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 1 = 0 ∂ L ( y , W , b ) ∂ w 22 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 2 = 0 ∂ L ( y , W , b ) ∂ w 23 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 3 = 0 ∂ L ( y , W , b ) ∂ b 2 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) = 0 \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{11}} =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a^{(1)}_{1}} \frac{\partial a^{(1)}_{1}}{\partial z^{(1)}_1} \frac{\partial z^{(1)}_1}{\partial w^{(1)}_{11}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_1=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{12}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_2=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{13}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_3=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{1}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{21}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_1=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{22}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_2=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{23}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_3=0\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{2}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)=0\\ w11(1)L(y,W,b)=y^L(y,W,b)a1(1)y^z1(1)a1(1)w11(1)z1(1)=σw11(2)f(z1(1))x1=0w12(1)L(y,W,b)=σw11(2)f(z1(1))x2=0w13(1)L(y,W,b)=σw11(2)f(z1(1))x3=0b1(1)L(y,W,b)=σw11(2)f(z1(1))=0w21(1)L(y,W,b)=σw12(2)f(z2(1))x1=0w22(1)L(y,W,b)=σw12(2)f(z2(1))x2=0w23(1)L(y,W,b)=σw12(2)f(z2(1))x3=0b2(1)L(y,W,b)=σw12(2)f(z2(1))=0
因为偏导都为0,所以第一次反向传播隐层参数不更新,仍然为0。

第二次正向传播时,由于隐层参数为0,所以
z 1 ( 1 ) = z 2 ( 1 ) = 0 a 1 ( 1 ) = a 2 ( 1 ) z^{(1)}_1=z^{(1)}_2=0\\ a^{(1)}_1=a^{(1)}_2\\ z1(1)=z2(1)=0a1(1)=a2(1)
第二次反向传播
∂ L ( y , W , b ) ∂ w 11 ( 2 ) = σ a 1 ( 1 ) ∂ L ( y , W , b ) ∂ w 12 ( 2 ) = σ a 2 ( 1 ) ∂ L ( y , W , b ) ∂ b 1 ( 2 ) = σ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{11}}=\sigma a^{(1)}_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(2)}_{12}}=\sigma a^{(1)}_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(2)}_{1}}=\sigma w11(2)L(y,W,b)=σa1(1)w12(2)L(y,W,b)=σa2(1)b1(2)L(y,W,b)=σ
损失函数对 w 11 ( 2 ) w^{(2)}_{11} w11(2) w 12 ( 2 ) w^{(2)}_{12} w12(2)的偏导数相同,更新后两者仍然相同。

对隐层
∂ L ( y , W , b ) ∂ w 11 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 1 = ∂ L ( y , W , b ) ∂ w 21 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 1 ∂ L ( y , W , b ) ∂ w 12 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 2 = ∂ L ( y , W , b ) ∂ w 22 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 2 ∂ L ( y , W , b ) ∂ w 13 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) x 3 = ∂ L ( y , W , b ) ∂ w 23 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) x 3 ∂ L ( y , W , b ) ∂ b 1 ( 1 ) = σ w 11 ( 2 ) f ′ ( z 1 ( 1 ) ) = ∂ L ( y , W , b ) ∂ b 2 ( 1 ) = σ w 12 ( 2 ) f ′ ( z 2 ( 1 ) ) \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{11}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_1 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{21}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_1\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{12}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_2 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{22}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_2\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{13}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1)x_3 =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial w^{(1)}_{23}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)x_3\\ \frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{1}} =\sigma w^{(2)}_{11}f'(z^{(1)}_1) =\frac{\partial L(y,\boldsymbol{W},\boldsymbol{b})}{\partial b^{(1)}_{2}} =\sigma w^{(2)}_{12}f'(z^{(1)}_2)\\ w11(1)L(y,W,b)=σw11(2)f(z1(1))x1=w21(1)L(y,W,b)=σw12(2)f(z2(1))x1w12(1)L(y,W,b)=σw11(2)f(z1(1))x2=w22(1)L(y,W,b)=σw12(2)f(z2(1))x2w13(1)L(y,W,b)=σw11(2)f(z1(1))x3=w23(1)L(y,W,b)=σw12(2)f(z2(1))x3b1(1)L(y,W,b)=σw11(2)f(z1(1))=b2(1)L(y,W,b)=σw12(2)f(z2(1))
那么更新后
w 1. ( 1 ) = w 2. ( 1 ) b 1 ( 1 ) = b 2 ( 1 ) \boldsymbol{w^{(1)}_{1.}}=\boldsymbol{w^{(1)}_{2.}}\\ b^{(1)}_1=b^{(1)}_2 w1.(1)=w2.(1)b1(1)=b2(1)
其实就是隐层所有神经元的参数都相同了。

可以预见的是,由于隐层神经元参数都相同,那么在之后
z 1 ( 1 ) = z 2 ( 1 ) a 1 ( 1 ) = a 2 ( 1 ) z^{(1)}_1=z^{(1)}_2\\ a^{(1)}_1=a^{(1)}_2\\ z1(1)=z2(1)a1(1)=a2(1)
反向传播时,仍会出现输出层两个权值相同,隐层神经元参数相同。

换句话说,隐层神经元仅相当于一个节点,这可能直接导致模型失效!

  • 6
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值