Vanishing and Exploding Gradients

最新推荐文章于 2024-08-23 17:48:13 发布

拉普拉斯的汪

最新推荐文章于 2024-08-23 17:48:13 发布

阅读量234

点赞数 1

分类专栏： Deep Learning 文章标签：机器学习深度学习

本文链接：https://blog.csdn.net/qq_39599295/article/details/119988925

版权

Deep Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Reference:

https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010: 249-256.

Consider a deep network with $L$ layers, input $\mathbf{x}$ and output $\mathbf{o}$ . With each layer $l$ defined by a transformation $f_l$ parameterized by weights $\mathbf{W}^{(l)}$ , whose hidden variable is $\mathbf{h}^{(l)}$ (let $\mathbf{h}^{(0)} = \mathbf{x}$ ), our network can be expressed as:
$\mathbf{h}^{(l)} = f_l (\mathbf{h}^{(l-1)}) \text{ and thus } \mathbf{o} = f_L \circ \ldots \circ f_1(\mathbf{x}).$
If all the hidden variables and the input are vectors, we can write the gradient of $\mathbf{o}$ with respect to any set of parameters $\mathbf{W}^{(l)}$ as follows:
$\partial_{\mathbf{W}^{(l)}} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{(L-1)}} \mathbf{h}^{(L)}}_{ \mathbf{M}^{(L)} \stackrel{\mathrm{def}}{=}} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{(l)}} \mathbf{h}^{(l+1)}}_{ \mathbf{M}^{(l+1)} \stackrel{\mathrm{def}}{=}} \underbrace{\partial_{\mathbf{W}^{(l)}} \mathbf{h}^{(l)}}_{ \mathbf{v}^{(l)} \stackrel{\mathrm{def}}{=}}.$
In other words, this gradient is the product of $L - l$ matrices $\mathbf{M}^{(L)} \cdot \ldots \cdot \mathbf{M}^{(l+1)}$ and the gradient vector $\mathbf{v}^{(l)}$ .

Thus we are susceptible to the same problems of numerical underflow that often crop up when multiplying together too many probabilities. Initially the matrices $\mathbf{M}^{(l)}$ may have a wide variety of eigenvalues. They might be small or large, and their product might be very large or very small.

The risks posed by unstable gradients go beyond numerical representation. Gradients of unpredictable magnitude also threaten the stability of our optimization algorithms. We may be facing parameter updates that are either

excessively large, destroying our model (the exploding gradient problem); or
excessively small (the vanishing gradient problem), rendering learning impossible as parameters hardly move on each update.

Choice of Activation Function

One frequent culprit causing the vanishing gradient problem is the choice of the activation function $\sigma$ that is appended following each layer’s linear operations. Historically, the sigmoid function
$\sigma(x)=1/(1 + \exp(-x))$
was popular because it resembles a thresholding function. Let us take a closer look at the sigmoid to see why it can cause vanishing gradients.
$\sigma'(x)=(1+\exp(-x))^{-2}\exp(-x)=\sigma(x)[1-\sigma(x)]$
../_images/output_numerical-stability-and-init_e60514_6_0.svg

As you can see, the sigmoid’s gradient vanishes both when its inputs are large and when they are small. Moreover, when backpropagating through many layers, unless we are in the Goldilocks zone, where the inputs to many of the sigmoids are close to zero, the gradients of the overall product may vanish. When our network boasts many layers, unless we are careful, the gradient will likely be cut off at some layer. Indeed, this problem used to plague deep network training.

Consequently, ReLUs, which are more stable have emerged as the default choice for practitioners.
$\mathrm{ReLU}(x)=\max(x,0)$

请添加图片描述

Xavier Initialization

If we write $\mathbf z^{i}$ for the activation vector of layer $i$ , and $\mathbf s^i$ the argument vector of the activation function at layer $i$ , we have
$\begin{aligned} \mathbf s^i &= \mathbf z^i \mathbf W^i+\mathbf b^i\\ \mathbf z^{i+1}&=f(\mathbf s^i) \end{aligned}$
To keep information flowing (avoid gradients vanishing and exploding), we would like that

From a forward-propagation point of view
$\forall (i,i'), \quad Var[z^i]=Var[z^{i'}] \tag{XI.1}$
From a back-propagation point of view
$\forall (i,i'), \quad Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^{i'}}\right] \tag{XI.2}$

Assumptions:

Use linear activation with
$f'(s_k^i)\approx 1$
(such that $\mathbf z^{i+1}\approx \mathbf s^{i}$ )
The inputs features variances are the same ( $= V a r [x]$ ).

Derivations:

Then we can say that, with $n_i$ the size of layer $i$ and $x$ the network input,
$Var[z^i]=n_{i-1}Var[W^{i-1}]Var[z^{i-1}]=\cdots =Var[x]\prod_{i'=0}^{i-1}n_{i'}Var[W^{i'}] \tag{XI.3}$
where $Var[W^{i'}]$ denotes the shared scalar variance of all weights at layer $i^{'}$ .

By using the chain rule, we can formulate the relationship between $\frac{\partial Cost}{\partial s_k^{i}}$ and $\frac{\partial Cost}{\partial s_k^{i+1}}$ :
$\frac{\partial Cost}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}\frac{\partial \mathbf s^{i+1}}{\partial \mathbf z^{i+1}}\frac{\partial \mathbf z^{i+1}}{\partial \mathbf s^{i}}=\frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^Tf'(\mathbf s^i)\approx \frac{\partial Cost}{\partial \mathbf s^{i+1}}(\mathbf W^{i+1})^T$

$\frac{\partial Cost}{\partial s_k^{i}}=W_{k,\cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf s^{i+1}}$

Then for a network with $d$ layers
$Var\left[\frac{\partial Cost}{\partial s^i}\right]=Var\left[\frac{\partial Cost}{\partial s^d}\right]\prod_{i'=i}^d n_{i'+1}Var[W^{i'}] \tag{XI.4}$
From $(X I . 3)$ and $(X I . 4)$ , we can observe that $(X I . 1)$ and $(X I . 2)$ transform to
$\forall i,\quad n_i Var[W^i]=1 \tag{XI.5}$

$\forall i, \quad n_{i+1}Var[W^i]=1 \tag{XI.6}$

As a compromise between these two constraints, we might want to have
$\forall i, \quad Var[W^i]=\frac{2}{n_i+n_{i+1}}\tag{XI.7}$
Based on this result, Glorot & Bengio proposed Xavier initialization:

Initializing the weights in the network by drawing them from a distribution with zero mean and a specific variance $2/(n_{in}+n_{out})$ , where $n_{in}$ and $n_{out}$ are the numbers of inputs/outputs of this layer. The distribution used is typically Gaussian or uniform.

拉普拉斯的汪

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Vanishing and Exploding Gradients

Reference:https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.htmlGlorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the thirteenth international conference on artificia
复制链接

扫一扫

专栏目录