Xavier initialization - 论文阅读笔记

3 篇文章 0 订阅
1 篇文章 0 订阅

论文: Understanding the difficulty of training deep feed forward neural networks

单层网络

Assume y i = w 1 i x 1 i + . . . + w n i x n i + b i y_i = w_{1i}x_{1i}+...+w_{n_i}x_{n_i}+b_i yi=w1ix1i+...+wnixni+bi, inputs and weights are zero-mean, independent and identically distributed.

Then we have

V a r ( w m i x m i = E [ w m i ] 2 V a r ( x m i ) + E [ x m i ] 2 V a r ( w m i ) + V a r ( w m i ) V a r ( x m i ) = V a r ( w m i ) V a r ( x m i ) Var(w_{m_i}x_{m_i}=E[w_{m_i}]^2Var(x_{m_i})+E[x_{m_i}]^2Var(w_{m_i})+Var(w_{m_i})Var(x_{m_i})=Var(w_{m_i})Var(x_{m_i}) Var(wmixmi=E[wmi]2Var(xmi)+E[xmi]2Var(wmi)+Var(wmi)Var(xmi)=Var(wmi)Var(xmi)

Since the inputs are identically distributed, so V a r ( m i ) = V a r ( k i ) ∀ m i , k i ∈ [ n i ] Var(m_i) = Var(k_i) \forall m_i, k_i \in [n_i] Var(mi)=Var(ki)mi,ki[ni]
So V a r ( y i ) = n i V a r ( w i ) V a r ( x i ) Var(y_i)=n_iVar(w_i)Var(x_i) Var(yi)=niVar(wi)Var(xi)

In order to ensure that the input and output variances are consistent, we have:

V a r ( w i ) = 1 n i Var(w_i)=\frac{1}{n_i} Var(wi)=ni1

多层网络

For a multi-layer network, assume s i = z i W i + b i s_i = z_iW_i+b_i si=ziWi+bi and z i + 1 = f ( s i ) z_{i+1}=f(s_i) zi+1=f(si) and there’re d d d layers in total.

Forward: V a r ( z i ) = V a r ( x ) ∏ j = 0 i − 1 n j V a r ( W j ) Var(z_i)=Var(x)\prod_{j=0}^{i-1}n_jVar(W_j) Var(zi)=Var(x)j=0i1njVar(Wj)

Backward: V a r ( ∂ C o s t ∂ f ( s i ) ) = V a r ( ∂ C o s t ∂ f ( s d ) ) ∏ j = i j = d n j V a r ( W j ) Var(\frac{\partial Cost}{\partial f(s_i)})=Var(\frac{\partial Cost}{\partial f(s_d)})\prod_{j=i}^{j=d}n_jVar(W_j) Var(f(si)Cost)=Var(f(sd)Cost)j=ij=dnjVar(Wj)

Considering forward propagation, we would like to have ∀ i , j ∈ [ d ] , V a r ( z i ) = V a r ( z j ) \forall i,j \in [d], Var(z_i)=Var(z_{j}) i,j[d],Var(zi)=Var(zj)

Considering backward propagation, we would like to have ∀ i , j ∈ [ d ] , V a r ( ∂ C o s t ∂ f ( s i ) ) = V a r ( ∂ C o s t ∂ f ( s j ) ) \forall i,j \in [d], Var(\frac{\partial Cost}{\partial f(s_i)})=Var(\frac{\partial Cost}{\partial f(s_j)}) i,j[d],Var(f(si)Cost)=Var(f(sj)Cost)

So we want to have

∀ i , n i V a r ( W i ) = 1 \forall i, n_iVar(W_i)=1 i,niVar(Wi)=1

∀ i , n i + 1 V a r ( W i ) = 1 \forall i, n_{i+1}Var(W_{i}) = 1 i,ni+1Var(Wi)=1

As a compromise between those two constraints, we have

∀ i , V a r ( W i ) = 2 n i + n i + 1 \forall i, Var(W_i)=\frac{2}{n_i+n_{i+1}} i,Var(Wi)=ni+ni+12

The variance of the uniform distribution between [ a , b ] [a,b] [a,b] is ( b − a ) 2 12 \frac{(b-a)^2}{12} 12(ba)2, so W ∼ U [ − 6 n i + n i + 1 , − 6 n i + n i + 1 ] W \sim U[-\sqrt{\frac{6}{n_i+n_{i+1}}},-\sqrt{\frac{6}{n_i+n_{i+1}}} ] WU[ni+ni+16 ,ni+ni+16 ]

We can initialize the gain with torch.nn.init.calculate_gain() or just use the default value.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值