[Xavier] Understanding the difficulty of training deep feedforward neural networks

文章目录

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]. international conference on artificial intelligence and statistics, 2010: 249-256.

@article{glorot2010understanding,
title={Understanding the difficulty of training deep feedforward neural networks},
author={Glorot, Xavier and Bengio, Yoshua},
pages={249–256},
year={2010}}

本文提出了Xavier参数初始化方法.

主要内容

在第 i = 1 , … , d i=1, \ldots, d i=1,,d层:
s i = z i W i + b i z i + 1 = f ( s i ) , \mathbf{s}^i=\mathbf{z}^i W^i+\mathbf{b}^i \\ \mathbf{z}^{i+1}= f(\mathbf{s}^i), si=ziWi+bizi+1=f(si),
其中 z i \mathbf{z}^i zi是第 i i i层的输入, s i \mathbf{s}^i si是激活前的值, f ( ⋅ ) f(\cdot) f()是激活函数(假设其在0点对称, 且 f ′ ( 0 ) = 1 f'(0)=1 f(0)=1 如tanh).


V a r ( z i ) = n l V a r ( w i z i ) , \mathrm{Var}(z^i) = n_l\mathrm{Var}(w^iz^i), Var(zi)=nlVar(wizi),
0 0 0附近近似成立(既然 f ′ ( 0 ) = 1 f'(0)=1 f(0)=1), 其中 z i , w i , z^i, w^i, zi,wi,分别是 z i , W i \mathbf{z}^i, W^i zi,Wi的某个元素, 且假设这些 { w i } \{w^i\} {wi}之间是独立同分布的, w i , z i w^i, z^i wi,zi是相互独立的, 进一步假设 E ( w i ) = 0 , E ( x ) = 0 \mathbb{E}(w^i)=0,\mathbb{E}(x)=0 E(wi)=0,E(x)=0( x x x是输入的样本), 则
V a r ( z i ) = n l V a r ( w i ) V a r ( z i ) , \mathrm{Var}(z^i) = n_l\mathrm{Var}(w^i)\mathrm{Var}(z^i), Var(zi)=nlVar(wi)Var(zi),
0 0 0点附近近似成立.


V a r ( z i ) = V a r ( x ) ∏ i ′ = 0 i − 1 n i ′ V a r ( w i ′ ) \mathrm{Var}(z^i) = \mathrm{Var}(x) \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}(w_{i'}) Var(zi)=Var(x)i=0i1niVar(wi)
其中 n i n_i ni表示第 i i i层输入的节点个数.

根据梯度反向传播可知:
∂ C o s t ∂ s k i = f ′ ( s k i ) W k , ⋅ i + 1 ∂ C o s t ∂ s i + 1 (2) \tag{2} \frac{\partial Cost}{\partial s_k^i} = f'(s_k^i) W_{k, \cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf{s}^{i+1}} skiCost=f(ski)Wk,i+1si+1Cost(2)
∂ C o s t ∂ w l , k i = z l i ∂ C o s t ∂ s k i . (3) \tag{3} \frac{\partial Cost}{\partial w_{l,k}^i} = z_l^i \frac{\partial Cost}{\partial s_k^i}. wl,kiCost=zliskiCost.(3)

于是
V a r [ ∂ C o s t ∂ s k i ] = V a r [ ∂ C o s t ∂ s d ] ∏ i ′ = i d n i ′ + 1 V a r [ w i ′ ] , (6) \tag{6} \mathrm{Var}[\frac{\partial Cost}{\partial s_k^i}] = \mathrm{Var}[\frac{\partial Cost}{\partial s^d}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}], Var[skiCost]=Var[sdCost]i=idni+1Var[wi],(6)
V a r [ ∂ C o s t ∂ w i ] = ∏ i ′ = 0 i − 1 n i ′ V a r [ w i ′ ] ∏ i ′ = i d n i ′ + 1 V a r [ w i ′ ] × V a r ( x ) V a r [ ∂ C o s t ∂ s d ] , \mathrm{Var}[\frac{\partial Cost}{\partial w^i}] = \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}[w^{i'}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}] \times \mathrm{Var}(x) \mathrm{Var}[\frac{\partial Cost}{\partial s^d}], Var[wiCost]=i=0i1niVar[wi]i=idni+1Var[wi]×Var(x)Var[sdCost],

当我们要求前向进程中关于 z i z^i zi的方差一致, 则
∀ i , n i V a r [ w i ] = 1. (10) \tag{10} \forall i, \quad n_i \mathrm{Var} [w^i]=1. i,niVar[wi]=1.(10)
当我们要求反向进程中梯度的方差 ∂ C o s t ∂ s i \frac{\partial Cost}{\partial s^i} siCost一致, 则
∀ i n i + 1 V a r [ w i ] = 1. (11) \tag{11} \forall i \quad n_{i+1} \mathrm{Var} [w^i]=1. ini+1Var[wi]=1.(11)

本文选了一个折中的方案
V a r [ w i ] = 2 n i + 1 + n i , \mathrm{Var} [w^i] = \frac{2}{n_{i+1}+n_{i}}, Var[wi]=ni+1+ni2,
并构造了一个均匀分布, w i w^i wi从其中采样
w i ∼ U [ − 6 n i + 1 + n i , 6 n i + 1 + n i ] . w^i \sim U[-\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}},\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}}]. wiU[ni+1+ni 6 ,ni+1+ni 6 ].

文章还有许多关于不同的激活函数的分析, 如sigmoid, tanh, softsign… 这些不是重点, 就不记录了.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值