Vanishing and Exploding Gradients

Reference:

https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010: 249-256.


Consider a deep network with L L L layers, input x \mathbf{x} x and output o \mathbf{o} o. With each layer l l l defined by a transformation f l f_l fl parameterized by weights W ( l ) \mathbf{W}^{(l)} W(l), whose hidden variable is h ( l ) \mathbf{h}^{(l)} h(l) (let h ( 0 ) = x \mathbf{h}^{(0)} = \mathbf{x} h(0)=x), our network can be expressed as:
h ( l ) = f l ( h ( l − 1 ) )  and thus  o = f L ∘ … ∘ f 1 ( x ) . \mathbf{h}^{(l)} = f_l (\mathbf{h}^{(l-1)}) \text{ and thus } \mathbf{o} = f_L \circ \ldots \circ f_1(\mathbf{x}). h(l)=fl(h(l1)) and thus o=fLf1(x).
If all the hidden variables and the input are vectors, we can write the gradient of o \mathbf{o} o with respect to any set of parameters W ( l ) \mathbf{W}^{(l)} W(l) as follows:
∂ W ( l ) o = ∂ h ( L − 1 ) h ( L ) ⏟ M ( L ) = d e f ⋅ … ⋅ ∂ h ( l ) h ( l + 1 ) ⏟ M ( l + 1 ) = d e f ∂ W ( l ) h ( l ) ⏟ v ( l ) = d e f . \partial_{\mathbf{W}^{(l)}} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{(L-1)}} \mathbf{h}^{(L)}}_{ \mathbf{M}^{(L)} \stackrel{\mathrm{def}}{=}} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{(l)}} \mathbf{h}^{(l+1)}}_{ \mathbf{M}^{(l+1)} \stackrel{\mathrm{def}}{=}} \underbrace{\partial_{\mathbf{W}^{(l)}} \mathbf{h}^{(l)}}_{ \mathbf{v}^{(l)} \stackrel{\mathrm{def}}{=}}. W(l)o=M(L)=def h(L1)h(L)M(l+1)=def h(l)h(l+1)v(l)=def W(l)h(l).
In other words, this gradient is the product of L − l L-l Ll matrices M ( L ) ⋅ … ⋅ M ( l + 1 ) \mathbf{M}^{(L)} \cdot \ldots \cdot \mathbf{M}^{(l+1)} M(L)M(l+1) and the gradient vector v ( l ) \mathbf{v}^{(l)} v(l).

Thus we are susceptible to the same problems of numerical underflow that often crop up when multiplying together too many probabilities. Initially the matrices M ( l ) \mathbf{M}^{(l)} M(l) may have a wide variety of eigenvalues. They might be small or large, and their product might be very large or very small.

The risks posed by unstable gradients go beyond numerical representation. Gradients of unpredictable magnitude also threaten the stability of our optimization algorithms. We may be facing parameter updates that are either

  • excessively large, destroying our model (the exploding gradient problem); or
  • excessively small (the vanishing gradient problem), rendering learning impossible as parameters hardly move on each update.

Choice of Activation Function

One frequent culprit causing the vanishing gradient problem is the choice of the activation function σ \sigma σ that is appended following each layer’s linear operations. Historically, the sigmoid function
σ ( x ) = 1 / ( 1 + exp ⁡ ( − x ) ) \sigma(x)=1/(1 + \exp(-x)) σ(x)=1/(1+exp(x))
was popular because it resembles a thresholding function. Let us take a closer look at the sigmoid to see why it can cause vanishing gradients.
σ ′ ( x ) = ( 1 + exp ⁡ ( − x ) ) − 2 exp ⁡ ( − x ) = σ ( x ) [ 1 − σ ( x ) ] \sigma'(x)=(1+\exp(-x))^{-2}\exp(-x)=\sigma(x)[1-\sigma(x)] σ(x)=(1+exp(x))2exp(x)=σ(x)[1σ(x)]
../_images/output_numerical-stability-and-init_e60514_6_0.svg

As you can see, the sigmoid’s gradient vanishes both when its inputs are large and when they are small. Moreover, when backpropagating through many layers, unless we are in the Goldilocks zone, where the inputs to many of the sigmoids are close to zero, the gradients of the overall product may vanish. When our network boasts many layers, unless we are careful, the gradient will likely be cut off at some layer. Indeed, this problem used to plague deep network training.

Consequently, ReLUs, which are more stable have emerged as the default choice for practitioners.
R e L U ( x ) = max ⁡ ( x , 0 ) \mathrm{ReLU}(x)=\max(x,0) ReLU(x)=max(x,0)

请添加图片描述

Xavier Initialization

If we write z i \mathbf z^{i} zi for the activation vector of layer i i i, and s i \mathbf s^i si the argument vector of the activation function at layer i i i, we have
s i = z i W i + b i z i + 1 = f ( s i ) \begin{aligned} \mathbf s^i &= \mathbf z^i \mathbf W^i+\mathbf b^i\\ \mathbf z^{i+1}&=f(\mathbf s^i) \end{aligned} sizi+1=ziWi+bi=f(si)
To keep information flowing (avoid gradients vanishing and exploding), we would like that

  • From a forward-propagation point of view
    ∀ ( i , i ′ ) , V a r [ z i ] = V a r [ z i ′ ] (XI.1) \forall (i,i'), \quad Var[z^i]=Var[z^{i'}] \tag{XI.1} (i,i),Var[zi]=Var[zi](XI.1)

  • From a back-propagation point of view
    ∀ ( i , i ′ ) , V a r [ ∂ C o s t ∂ s i ] = V a r [ ∂ C o s t ∂ s i ′ ] (XI.2) \forall (i,i'), \quad Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^{i'}}\right] \tag{XI.2} (i,i),Var[siCost]=Var[siCost](XI.2)

Assumptions:

  • Use linear activation with
    f ′ ( s k i ) ≈ 1 f'(s_k^i)\approx 1 f(ski)1
    (such that z i + 1 ≈ s i \mathbf z^{i+1}\approx \mathbf s^{i} zi+1si)

  • The inputs features variances are the same ( = V a r [ x ] = Var[x] =Var[x]).

Derivations:

Then we can say that, with n i n_i ni the size of layer i i i and x x x the network input,
V a r [ z i ] = n i − 1 V a r [ W i − 1 ] V a r [ z i − 1 ] = ⋯ = V a r [ x ] ∏ i ′ = 0 i − 1 n i ′ V a r [ W i ′ ] (XI.3) Var[z^i]=n_{i-1}Var[W^{i-1}]Var[z^{i-1}]=\cdots =Var[x]\prod_{i'=0}^{i-1}n_{i'}Var[W^{i'}] \tag{XI.3} Var[zi]=ni1Var[Wi1]Var[zi1]==Var[x]i=0i1niVar[Wi](XI.3)
where V a r [ W i ′ ] Var[W^{i'}] Var[Wi] denotes the shared scalar variance of all weights at layer i ′ i' i.

By using the chain rule, we can formulate the relationship between ∂ C o s t ∂ s k i \frac{\partial Cost}{\partial s_k^{i}} skiCost and ∂ C o s t ∂ s k i + 1 \frac{\partial Cost}{\partial s_k^{i+1}} ski+1Cost:
∂ C o s t ∂ s i = ∂ C o s t ∂ s i + 1 ∂ s i + 1 ∂ z i + 1 ∂ z i + 1 ∂ s i = ∂ C o s t ∂ s i + 1 ( W i + 1 ) T f ′ ( s i ) ≈ ∂ C o s t ∂ s i + 1 ( W i + 1 ) T \frac{\partial Cost}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}\frac{\partial \mathbf s^{i+1}}{\partial \mathbf z^{i+1}}\frac{\partial \mathbf z^{i+1}}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^Tf'(\mathbf s^i)\approx \frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^T siCost=si+1Costzi+1si+1sizi+1=si+1Cost(Wi+1)Tf(si)si+1Cost(Wi+1)T

∂ C o s t ∂ s k i = W k , ⋅ i + 1 ∂ C o s t ∂ s i + 1 \frac{\partial Cost}{\partial s_k^{i}}=W_{k,\cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf s^{i+1}} skiCost=Wk,i+1si+1Cost

Then for a network with d d d layers
V a r [ ∂ C o s t ∂ s i ] = V a r [ ∂ C o s t ∂ s d ] ∏ i ′ = i d n i ′ + 1 V a r [ W i ′ ] (XI.4) Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^d}\right]\prod_{i'=i}^d n_{i'+1}Var[W^{i'}] \tag{XI.4} Var[siCost]=Var[sdCost]i=idni+1Var[Wi](XI.4)
From ( X I . 3 ) (XI.3) (XI.3) and ( X I . 4 ) (XI.4) (XI.4), we can observe that ( X I . 1 ) (XI.1) (XI.1) and ( X I . 2 ) (XI.2) (XI.2) transform to
∀ i , n i V a r [ W i ] = 1 (XI.5) \forall i,\quad n_i Var[W^i]=1 \tag{XI.5} i,niVar[Wi]=1(XI.5)

∀ i , n i + 1 V a r [ W i ] = 1 (XI.6) \forall i, \quad n_{i+1}Var[W^i]=1 \tag{XI.6} i,ni+1Var[Wi]=1(XI.6)

As a compromise between these two constraints, we might want to have
∀ i , V a r [ W i ] = 2 n i + n i + 1 (XI.7) \forall i, \quad Var[W^i]=\frac{2}{n_i+n_{i+1}}\tag{XI.7} i,Var[Wi]=ni+ni+12(XI.7)
Based on this result, Glorot & Bengio proposed Xavier initialization:

Initializing the weights in the network by drawing them from a distribution with zero mean and a specific variance 2 / ( n i n + n o u t ) 2/(n_{in}+n_{out}) 2/(nin+nout), where n i n n_{in} nin and n o u t n_{out} nout are the numbers of inputs/outputs of this layer. The distribution used is typically Gaussian or uniform.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值