Reference:
https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010: 249-256.
Consider a deep network with
L
L
L layers, input
x
\mathbf{x}
x and output
o
\mathbf{o}
o. With each layer
l
l
l defined by a transformation
f
l
f_l
fl parameterized by weights
W
(
l
)
\mathbf{W}^{(l)}
W(l), whose hidden variable is
h
(
l
)
\mathbf{h}^{(l)}
h(l) (let
h
(
0
)
=
x
\mathbf{h}^{(0)} = \mathbf{x}
h(0)=x), our network can be expressed as:
h
(
l
)
=
f
l
(
h
(
l
−
1
)
)
and thus
o
=
f
L
∘
…
∘
f
1
(
x
)
.
\mathbf{h}^{(l)} = f_l (\mathbf{h}^{(l-1)}) \text{ and thus } \mathbf{o} = f_L \circ \ldots \circ f_1(\mathbf{x}).
h(l)=fl(h(l−1)) and thus o=fL∘…∘f1(x).
If all the hidden variables and the input are vectors, we can write the gradient of
o
\mathbf{o}
o with respect to any set of parameters
W
(
l
)
\mathbf{W}^{(l)}
W(l) as follows:
∂
W
(
l
)
o
=
∂
h
(
L
−
1
)
h
(
L
)
⏟
M
(
L
)
=
d
e
f
⋅
…
⋅
∂
h
(
l
)
h
(
l
+
1
)
⏟
M
(
l
+
1
)
=
d
e
f
∂
W
(
l
)
h
(
l
)
⏟
v
(
l
)
=
d
e
f
.
\partial_{\mathbf{W}^{(l)}} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{(L-1)}} \mathbf{h}^{(L)}}_{ \mathbf{M}^{(L)} \stackrel{\mathrm{def}}{=}} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{(l)}} \mathbf{h}^{(l+1)}}_{ \mathbf{M}^{(l+1)} \stackrel{\mathrm{def}}{=}} \underbrace{\partial_{\mathbf{W}^{(l)}} \mathbf{h}^{(l)}}_{ \mathbf{v}^{(l)} \stackrel{\mathrm{def}}{=}}.
∂W(l)o=M(L)=def
∂h(L−1)h(L)⋅…⋅M(l+1)=def
∂h(l)h(l+1)v(l)=def
∂W(l)h(l).
In other words, this gradient is the product of
L
−
l
L-l
L−l matrices
M
(
L
)
⋅
…
⋅
M
(
l
+
1
)
\mathbf{M}^{(L)} \cdot \ldots \cdot \mathbf{M}^{(l+1)}
M(L)⋅…⋅M(l+1) and the gradient vector
v
(
l
)
\mathbf{v}^{(l)}
v(l).
Thus we are susceptible to the same problems of numerical underflow that often crop up when multiplying together too many probabilities. Initially the matrices M ( l ) \mathbf{M}^{(l)} M(l) may have a wide variety of eigenvalues. They might be small or large, and their product might be very large or very small.
The risks posed by unstable gradients go beyond numerical representation. Gradients of unpredictable magnitude also threaten the stability of our optimization algorithms. We may be facing parameter updates that are either
- excessively large, destroying our model (the exploding gradient problem); or
- excessively small (the vanishing gradient problem), rendering learning impossible as parameters hardly move on each update.
Choice of Activation Function
One frequent culprit causing the vanishing gradient problem is the choice of the activation function
σ
\sigma
σ that is appended following each layer’s linear operations. Historically, the sigmoid function
σ
(
x
)
=
1
/
(
1
+
exp
(
−
x
)
)
\sigma(x)=1/(1 + \exp(-x))
σ(x)=1/(1+exp(−x))
was popular because it resembles a thresholding function. Let us take a closer look at the sigmoid to see why it can cause vanishing gradients.
σ
′
(
x
)
=
(
1
+
exp
(
−
x
)
)
−
2
exp
(
−
x
)
=
σ
(
x
)
[
1
−
σ
(
x
)
]
\sigma'(x)=(1+\exp(-x))^{-2}\exp(-x)=\sigma(x)[1-\sigma(x)]
σ′(x)=(1+exp(−x))−2exp(−x)=σ(x)[1−σ(x)]
As you can see, the sigmoid’s gradient vanishes both when its inputs are large and when they are small. Moreover, when backpropagating through many layers, unless we are in the Goldilocks zone, where the inputs to many of the sigmoids are close to zero, the gradients of the overall product may vanish. When our network boasts many layers, unless we are careful, the gradient will likely be cut off at some layer. Indeed, this problem used to plague deep network training.
Consequently, ReLUs, which are more stable have emerged as the default choice for practitioners.
R
e
L
U
(
x
)
=
max
(
x
,
0
)
\mathrm{ReLU}(x)=\max(x,0)
ReLU(x)=max(x,0)
Xavier Initialization
If we write
z
i
\mathbf z^{i}
zi for the activation vector of layer
i
i
i, and
s
i
\mathbf s^i
si the argument vector of the activation function at layer
i
i
i, we have
s
i
=
z
i
W
i
+
b
i
z
i
+
1
=
f
(
s
i
)
\begin{aligned} \mathbf s^i &= \mathbf z^i \mathbf W^i+\mathbf b^i\\ \mathbf z^{i+1}&=f(\mathbf s^i) \end{aligned}
sizi+1=ziWi+bi=f(si)
To keep information flowing (avoid gradients vanishing and exploding), we would like that
-
From a forward-propagation point of view
∀ ( i , i ′ ) , V a r [ z i ] = V a r [ z i ′ ] (XI.1) \forall (i,i'), \quad Var[z^i]=Var[z^{i'}] \tag{XI.1} ∀(i,i′),Var[zi]=Var[zi′](XI.1) -
From a back-propagation point of view
∀ ( i , i ′ ) , V a r [ ∂ C o s t ∂ s i ] = V a r [ ∂ C o s t ∂ s i ′ ] (XI.2) \forall (i,i'), \quad Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^{i'}}\right] \tag{XI.2} ∀(i,i′),Var[∂si∂Cost]=Var[∂si′∂Cost](XI.2)
Assumptions:
-
Use linear activation with
f ′ ( s k i ) ≈ 1 f'(s_k^i)\approx 1 f′(ski)≈1
(such that z i + 1 ≈ s i \mathbf z^{i+1}\approx \mathbf s^{i} zi+1≈si) -
The inputs features variances are the same ( = V a r [ x ] = Var[x] =Var[x]).
Derivations:
Then we can say that, with
n
i
n_i
ni the size of layer
i
i
i and
x
x
x the network input,
V
a
r
[
z
i
]
=
n
i
−
1
V
a
r
[
W
i
−
1
]
V
a
r
[
z
i
−
1
]
=
⋯
=
V
a
r
[
x
]
∏
i
′
=
0
i
−
1
n
i
′
V
a
r
[
W
i
′
]
(XI.3)
Var[z^i]=n_{i-1}Var[W^{i-1}]Var[z^{i-1}]=\cdots =Var[x]\prod_{i'=0}^{i-1}n_{i'}Var[W^{i'}] \tag{XI.3}
Var[zi]=ni−1Var[Wi−1]Var[zi−1]=⋯=Var[x]i′=0∏i−1ni′Var[Wi′](XI.3)
where
V
a
r
[
W
i
′
]
Var[W^{i'}]
Var[Wi′] denotes the shared scalar variance of all weights at layer
i
′
i'
i′.
By using the chain rule, we can formulate the relationship between
∂
C
o
s
t
∂
s
k
i
\frac{\partial Cost}{\partial s_k^{i}}
∂ski∂Cost and
∂
C
o
s
t
∂
s
k
i
+
1
\frac{\partial Cost}{\partial s_k^{i+1}}
∂ski+1∂Cost:
∂
C
o
s
t
∂
s
i
=
∂
C
o
s
t
∂
s
i
+
1
∂
s
i
+
1
∂
z
i
+
1
∂
z
i
+
1
∂
s
i
=
∂
C
o
s
t
∂
s
i
+
1
(
W
i
+
1
)
T
f
′
(
s
i
)
≈
∂
C
o
s
t
∂
s
i
+
1
(
W
i
+
1
)
T
\frac{\partial Cost}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}\frac{\partial \mathbf s^{i+1}}{\partial \mathbf z^{i+1}}\frac{\partial \mathbf z^{i+1}}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^Tf'(\mathbf s^i)\approx \frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^T
∂si∂Cost=∂si+1∂Cost∂zi+1∂si+1∂si∂zi+1=∂si+1∂Cost(Wi+1)Tf′(si)≈∂si+1∂Cost(Wi+1)T
∂ C o s t ∂ s k i = W k , ⋅ i + 1 ∂ C o s t ∂ s i + 1 \frac{\partial Cost}{\partial s_k^{i}}=W_{k,\cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf s^{i+1}} ∂ski∂Cost=Wk,⋅i+1∂si+1∂Cost
Then for a network with
d
d
d layers
V
a
r
[
∂
C
o
s
t
∂
s
i
]
=
V
a
r
[
∂
C
o
s
t
∂
s
d
]
∏
i
′
=
i
d
n
i
′
+
1
V
a
r
[
W
i
′
]
(XI.4)
Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^d}\right]\prod_{i'=i}^d n_{i'+1}Var[W^{i'}] \tag{XI.4}
Var[∂si∂Cost]=Var[∂sd∂Cost]i′=i∏dni′+1Var[Wi′](XI.4)
From
(
X
I
.
3
)
(XI.3)
(XI.3) and
(
X
I
.
4
)
(XI.4)
(XI.4), we can observe that
(
X
I
.
1
)
(XI.1)
(XI.1) and
(
X
I
.
2
)
(XI.2)
(XI.2) transform to
∀
i
,
n
i
V
a
r
[
W
i
]
=
1
(XI.5)
\forall i,\quad n_i Var[W^i]=1 \tag{XI.5}
∀i,niVar[Wi]=1(XI.5)
∀ i , n i + 1 V a r [ W i ] = 1 (XI.6) \forall i, \quad n_{i+1}Var[W^i]=1 \tag{XI.6} ∀i,ni+1Var[Wi]=1(XI.6)
As a compromise between these two constraints, we might want to have
∀
i
,
V
a
r
[
W
i
]
=
2
n
i
+
n
i
+
1
(XI.7)
\forall i, \quad Var[W^i]=\frac{2}{n_i+n_{i+1}}\tag{XI.7}
∀i,Var[Wi]=ni+ni+12(XI.7)
Based on this result, Glorot & Bengio proposed Xavier initialization:
Initializing the weights in the network by drawing them from a distribution with zero mean and a specific variance 2 / ( n i n + n o u t ) 2/(n_{in}+n_{out}) 2/(nin+nout), where n i n n_{in} nin and n o u t n_{out} nout are the numbers of inputs/outputs of this layer. The distribution used is typically Gaussian or uniform.