Xavier initialization - 论文阅读笔记_xavier论文引用-CSDN博客

本文链接：https://blog.csdn.net/qq_33804792/article/details/117373759

论文： Understanding the difficulty of training deep feed forward neural networks

文章目录

单层网络
多层网络

单层网络

Assume $y_i = w_{1i}x_{1i}+...+w_{n_i}x_{n_i}+b_i$ , inputs and weights are zero-mean, independent and identically distributed.

Then we have

$Var(w_{m_i}x_{m_i}=E[w_{m_i}]^2Var(x_{m_i})+E[x_{m_i}]^2Var(w_{m_i})+Var(w_{m_i})Var(x_{m_i})=Var(w_{m_i})Var(x_{m_i})$

Since the inputs are identically distributed, so $Var(m_i) = Var(k_i) \forall m_i, k_i \in [n_i]$
So $Var(y_i)=n_iVar(w_i)Var(x_i)$

In order to ensure that the input and output variances are consistent, we have:

$Var(w_i)=\frac{1}{n_i}$

多层网络

For a multi-layer network, assume $s_i = z_iW_i+b_i$ and $z_{i+1}=f(s_i)$ and there’re $d$ layers in total.

Forward: $Var(z_i)=Var(x)\prod_{j=0}^{i-1}n_jVar(W_j)$

Backward: $Var(\frac{\partial Cost}{\partial f(s_i)})=Var(\frac{\partial Cost}{\partial f(s_d)})\prod_{j=i}^{j=d}n_jVar(W_j)$

Considering forward propagation, we would like to have $\forall i,j \in [d], Var(z_i)=Var(z_{j})$

Considering backward propagation, we would like to have $\forall i,j \in [d], Var(\frac{\partial Cost}{\partial f(s_i)})=Var(\frac{\partial Cost}{\partial f(s_j)})$

So we want to have

$\forall i, n_iVar(W_i)=1$

$\forall i, n_{i+1}Var(W_{i}) = 1$

As a compromise between those two constraints, we have

$\forall i, Var(W_i)=\frac{2}{n_i+n_{i+1}}$

The variance of the uniform distribution between $[a, b]$ is $\frac{(b-a)^2}{12}$ , so $\sim U[-\sqrt{\frac{6}{n_i+n_{i+1}}},-\sqrt{\frac{6}{n_i+n_{i+1}}} ]$