LSTM的公式推导详解

最新推荐文章于 2024-12-06 22:55:27 发布

morris_mao

最新推荐文章于 2024-12-06 22:55:27 发布

阅读量4.7w

点赞数 74

分类专栏：深度学习机器学习文章标签： LSTM RNN 反向传播梯度下降深度学习

本文链接：https://blog.csdn.net/u010754290/article/details/47167979

版权

机器学习同时被 2 个专栏收录

2 篇文章

订阅专栏

深度学习

1 篇文章

订阅专栏

导言

在Alex Graves的这篇论文《Supervised Sequence Labelling with Recurrent Neural Networks》中对LSTM进行了综述性的介绍，并对LSTM的Forward Pass和Backward Pass进行了公式推导。

这篇文章将用更简洁的图示和公式一步步对Forward和Backward进行推导，相信读者看完之后能对LSTM有更深入的理解。

如果读者对LSTM的由来和原理存在困惑，推荐DarkScope的这篇博客：《RNN以及LSTM的介绍和公式梳理》

一、LSTM的基础结构

LSTM的结构中每个时刻的隐层包含了多个memory blocks（一般我们采用一个block），每个block包含了多个memory cell，每个memory cell包含一个Cell和三个gate，一个基础的结构示例如下图：

一个memory cell只能产出一个标量值，一个block能产出一个向量。

二、LSTM的前向传播（Forward Pass）

1. 引入

首先我们在上述LSTM的基础结构之上构造时序结构，这样让读者更清晰地看到Recurrent的结构：

LSTM的整体结构

这里我们有几个约定：

每个时刻的隐层包含一个block
每个block包含一个memory cell

下面前向传播我们则从Input开始，逐个求解Input Gate、Forget Gate、Cells Gate、Ouput Gate和最终的Output

这里需要申明的一点，推导过程严格按照上述图示LSTM的结构；论文中对相较于该文章的推导过程会有增加一些项，在每一个公式不一致的地方我都会有相应说明。

2. Input Gate( $\iota$ ) 的计算

Input Gate接受两个输入：

当前时刻的Input作为输入： $x^t$
上一时刻同一block内所有Cell作为输入： $s_c^{t-1}$

该案例中每层仅有单个Block、单个cemory cell，可以忽略 $\sum_{c=1}^{C}$ ，以下Forget Gate和Output Gate做相同处理。

Input Gate

最终Input Gate的输出为：

a t ι = \sum i = 1 I ω i ι x t i + \sum c = 1 C ω c ι s t - 1 c

$a_\iota^t = \sum_{i=1}^{I} \omega_{i\iota} x_i^t + \sum_{c=1}^{C} \omega_{c\iota} s_c^{t-1}$

b t ι = f (a t ι)

$b_\iota^t = f(a_\iota^t)$

这里Input Gate还可以接受上一个时刻中不同block的输出 $b_h^{t-1}$ 作为输入，论文中 $a_\iota^t$ 会增加一项 $\sum_{h=1}^{H} \omega_{h\iota} b_h^{t-1}$ 。

3. Forget Gate( $\phi$ ) 的计算

Forget Gate接受两个输入：

当前时刻的Input作为输入： $x^t$
上一时刻同一block内所有Cell作为输入： $s_c^{t-1}$

Forget Gate

最终Forget Gate的输出为：

a t ϕ = \sum i = 1 I ω i ϕ x t i + \sum c = 1 C ω c ϕ s t - 1 c

$a_\phi^t = \sum_{i=1}^{I} \omega_{i\phi} x_i^t + \sum_{c=1}^{C} \omega_{c\phi} s_c^{t-1}$

b t ϕ = f (a t ϕ)

$b_\phi^t = f(a_\phi^t)$

这里Input Gate还可以接受上一个时刻中不同block的输出 $b_h^{t-1}$ 作为输入，论文中 $a_\phi^t$ 会增加一项 $\sum_{h=1}^{H} \omega_{h\phi} b_h^{t-1}$ 。

4. Cell( $c$ ) 的计算

Cell的计算稍有些复杂，接受两个输入：

Input Gate和Input输入的乘积
Forget Gate和上一时刻对应Cell输出的乘积

Cell

最终Cell的输出为：

a t c = \sum i = 1 I ω i c x t i

$a_c^t = \sum_{i=1}^{I} \omega_{ic} x_i^t$

s t c = b t ϕ s t - 1 c + b t ι g (a t c)

$s_c^t = b_\phi^t s_c^{t-1} + b_\iota^t g(a_c^t)$

这里Input Gate还可以接受上一个时刻中不同block的输出 $b_h^{t-1}$ 作为输入，论文中 $a_c^t$ 会增加一项 $\sum_{h=1}^{H} \omega_{hc} b_h^{t-1}$ 。

5. Output Gate( $\omega$ ) 的计算

Output Gate接受两个输入：

当前时刻的Input作为输入： $x^t$
当前时刻同一block内所有Cell作为输入： $s_c^t$

这里Output Gate接受“当前时刻Cell的输出”而不是“上一时刻Cell的输出”，是由于此时Cell的结果已经产出，我们控制Output Gate的输出直接采用Cell当前的结果就行了，无须使用上一时刻。

Output Gate

最终Output Gate的输出为：

a t ω = \sum i = 1 I ω i ω x t i + \sum c = 1 C ω c ω s t c

$a_\omega ^t = \sum_{i=1}^{I} \omega_{i\omega} x_i^t + \sum_{c=1}^{C} \omega_{c\omega} s_c^t$

b t ω = f (a t ω)

$b_\omega^t = f(a_\omega^t)$

这里Cell还可以接受上一个时刻中其他gate链接过来的边，论文中 $a_\phi^t$ 会增加一项 $\sum_{h=1}^{H} \omega_{h\phi} b_h^{t-1}$ ，这里 $H$ 是泛指t-1时刻的Cell或三个Gate。

6. Cell Output( $c$ ) 的计算

Cell Output的计算即将Output Gate和Cell做乘积即可。

最终Cell Output为：

$b t c = b t ω h (s t c)$ $b_c^t = b_\omega^t h(s_c^t)$

7. 小结

至此，整个Block从Input到Output整个Forward Pass已经结束，其中涉及三个Gate和中间Cell的计算，需要注意的是三个Gate使用的激活函数是 $f$ ，而Input的激活函数是 $g$ 、Cell输出的激活函数是 $h$ 。

这里读者需要注意，在整个计算过程中，当前时刻的三个Gate均可以从上一时刻的任意Gate中接受输入，在公式中存在体现，但是在图示中并未画出相应的边。我们可以认为只有上一时刻的Cell才和当前时刻的Cell或三个Gate相连。
前向小结

三、LSTM的反向传播（Backward Pass）

1. 引入

此处在论文中使用“Backward Pass”一词，但其实即Back Propagation过程，利用链式求导求解整个LSTM中每个权重的梯度。

2. 损失函数的选择

为了通用起见，在此我们仅展示多分类问题的损失函数的选择，对于网络的最终输出我们利用 $softmax$ 方程计算结果属于某一类的概率（此时结果属于k个类别的概率和为1）。

p (C k | x) = y k = e a k \sum K k ' = 1 e a k '

$p(C_k|x) = y_k = \frac{e^a_k}{\sum_{k' =1}^{K} e^a_{k'}}$

注意， $y_k$ 对 $a_k$ 的偏导为 $\frac{\partial y_{k'}}{\partial a_k}=y_k\delta_{kk'} - y_ky_{k'}$ （ $\delta_{kk'}$ 当 $k==k'$ 时为1，其他为0）

其中，对于网络输出 $a_1, a_2,...$ 对应我们可以得到 $p(C_1|x), p(C_2|x),...$ ，即给定输入 $x$ 输出类别为 $C_1, C_2,...$ 的概率。

这样损失函数（Loss Function）就很好定义了：对于 $k\in{1,2,...,K}$ ，网络输出的类别为k概率为 $y_k$ ，而真实值 $z_k$ ：

 (x, z) = - l n p (z | x) = - \sum k = 1 K z k l n y k

$\mathcal{L}(x, z) = -lnp(z|x) = -\sum_{k=1}^{K} z_klny_k$

3. 权重的更新

对于神经网络中的每一个权重，我们都需要找到对应的梯度，从而通过不断地用训练样本进行随机梯度下降找到全局最优解，那么首先我们需要知道哪些权重需要更新。

一般层次分明的神经网络有input层、hidden层和output层，层与层之间的权重比较直观；但在LSTM中通过公式才能找到对应的权重，和图示中的边并不是一一对应，下面我将LSTM的单个Block中需要更新的权重在图示上标示了出来：

为了方便起见，这里需要申明的是：我们仅考虑上一时刻的Cell仅和当前时刻的Cell和三个Gate相连。

2. Cell Output的梯度

首先我们计算每一个输出类别的梯度：

δ t k = = = = = = = = \partial  ( x , z ) \partial a t k \partial ( - \sum K k ' = 1 z k ' l n y k ' ) a t k - \sum k' = 1 K z k' \partial l n y k ' \partial a t k - \sum k' = 1 K z k ' y k ' \partial y k ' \partial a t k - \sum k' = 1 K z k ' y k ' (y k δ k k' - y k y k') - \sum k' = 1 K z k ' y k ' y k δ k k' + \sum k' = 1 K z k ' y k ' y k y k' - z k + y k \sum k' = 1 K z k' y k - z k

$\begin{align} \delta_k^t =& \frac{\partial \mathcal{L}(x,z)}{\partial a_k^t}\\ =& \frac{\partial (-\sum_{{k'}=1}^{K} z_{k'}lny_{k'})}{a_k^t}\\ =& -\sum_{k'=1}^{K} z_{k'} \frac{\partial lny_{k'}}{\partial a_k^t}\\ =& -\sum_{k'=1}^{K} \frac{z_{k'}}{y_{k'}} \frac{\partial y_{k'}}{\partial a_k^t}\\ =& -\sum_{k'=1}^{K} \frac{z_{k'}}{y_{k'}} (y_k\delta_{kk'} - y_ky_{k'})\\ =& -\sum_{k'=1}^{K} \frac{z_{k'}}{y_{k'}} y_k\delta_{kk'} + \sum_{k'=1}^{K} \frac{z_{k'}}{y_{k'}} y_ky_{k'}\\ =& -z_k + y_k\sum_{{k'}=1}^K z_{k'}\\ =& y_k - z_k \end{align}$

也即每一个输出类别的梯度仅和其预测值和真实值相关，这样对于Cell Output的梯度则可以通过链式求导法则推导出来：

ϵ t c = \partial  ( x , z ) \partial b t c = \sum k = 1 K \partial  ( x , z ) \partial a t k \partial a t k \partial b t c = \sum k = 1 K δ t k ω c k

$\epsilon_c^t = \frac{\partial \mathcal{L}(x,z)}{\partial b_c^t} = \sum_{k=1}^{K}\frac{\partial \mathcal{L}(x,z)}{\partial a_k^t} \frac{\partial a_k^t}{\partial b_c^t} = \sum_{k=1}^{K} \delta_k^t \omega_{ck}$

由于Output还可以连接下一个时刻的一个Cell、三个Gate，那么下一个时刻的一个Cell、三个Gate的梯度则可以传递回当前时刻Output，所以在论文中存在额外项 $\sum_{g=1}^G\omega_{cg}\delta_g^{t+1}$ ，为简便起见，公式和图示中未包含。

Cell Output

3. Output Gate的梯度

根据链式求导法则，Output Gate的梯度可以由以下公式推导出来：

δ t ω = \partial  ( x , z ) \partial a t ω = \partial  ( x , z ) \partial b t c \partial b t c \partial b t ω \partial b t ω \partial a t ω = ϵ t c h (s t c) f' (a t w)

$\delta_\omega^t = \frac{\partial \mathcal{L}(x,z)}{\partial a_\omega^t} = \frac{\partial \mathcal{L}(x,z)}{\partial b_c^t} \frac{\partial b_c^t}{\partial b_\omega^t} \frac{\partial b_\omega^t}{\partial a_\omega^t}=\epsilon_c^t h(s_c^t)f'(a_w^t)$

另外，由于单个Block内可以存在多个memory cell、一个Forget Gate、一个Input Gate和一个Output Gate，论文中将Output Gate的梯度写成了 $f'(a_w^t) \sum_{c=1}^{C} \epsilon_c^t h(s_c^t)$ ，但推导过程一致。推导过程见下图，说明梯度汇总到单个Gate中：

Output Gate

4. Cell的梯度

细心的读者在这里会发现，Cell的计算结构和普遍的神经网络不太一样，让我们首先来回顾一下Cell部分的Forward计算过程：

a t c = \sum i = 1 I ω i c x t i

$a_c^t = \sum_{i=1}^{I} \omega_{ic} x_i^t$

s t c = b t ϕ s t - 1 c + b t ι g (a t c)

$s_c^t = b_\phi^t s_c^{t-1} + b_\iota^t g(a_c^t)$

输入数据贡献给 $a_c^t$ ，而Cell同时能够接受Input Gate和Forget Gate的输入。

这样梯度就直接从Cell向下传递：

δ t c = \partial  ( x , z ) \partial a t c = \partial  ( x , z ) \partial s t c \partial s t c \partial a t c = \partial  ( x , z ) \partial s t c b t ι g' (a t c)

$\delta_c^t = \frac{\partial \mathcal{L}(x,z)}{\partial a_c^t} = \frac{\partial \mathcal{L}(x,z)}{\partial s_c^t} \frac{\partial s_c^t}{\partial a_c^t} =\frac{\partial \mathcal{L}(x,z)}{\partial s_c^t} b_\iota^tg'(a_c^t)$

在这里，我们定义States，由于Cell的梯度可以由以下几个计算单元传递回来：

当前时刻的Cell Output
下一个时刻的Cell
下一个时刻的Input Gate
下一个时刻的Output Gate

那么States可以这样求解，上面1~4个能够回传梯度的计算单元和下面公式中一一对应：

ϵ t s = = = = \partial  ( x , z ) \partial s t c \partial  t ( x , z ) \partial s t c + \partial  t + 1 ( x , z ) \partial s t + 1 c \partial s t + 1 c \partial s t c + \partial  t + 1 ( x , z ) \partial a t + 1 ι \partial a t + 1 ι \partial s t c + \partial  t + 1 ( x , z ) \partial a t + 1 ϕ \partial a t + 1 ϕ \partial s t c (\partial  ( x , z ) \partial a t w \partial a t w \partial s t c + \partial  ( x , z ) \partial b t c \partial b t c \partial s t c) + b t + 1 ϕ ϵ t + 1 s + ω c ι δ t + 1 ι + ω c ϕ δ t + 1 ϕ δ t ω ω c ω + ϵ t c b t ω h' (s t c) + b t + 1 ϕ ϵ t + 1 s + ω c ι δ t + 1 ι + ω c ϕ δ t + 1 ϕ

$\begin{align} \epsilon_s^t =& \frac{\partial \mathcal{L}(x,z)}{\partial s_c^t}\\ =& \frac{\partial \mathcal{L}^t(x,z)}{\partial s_c^t} + \frac{\partial \mathcal{L}^{t+1}(x,z)}{\partial s_c^{t+1}}\frac{\partial s_c^{t+1}}{\partial s_c^t} + \frac{\partial \mathcal{L}^{t+1}(x,z)}{\partial a_\iota^{t+1}}\frac{\partial a_\iota^{t+1}}{\partial s_c^t} + \frac{\partial \mathcal{L}^{t+1}(x,z)}{\partial a_\phi^{t+1}}\frac{\partial a_\phi^{t+1}}{\partial s_c^t}\\ =& (\frac{\partial \mathcal{L}(x,z)}{\partial a_w^t}\frac{\partial a_w^t}{\partial s_c^t} + \frac{\partial \mathcal{L}(x,z)}{\partial b_c^t}\frac{\partial b_c^t}{\partial s_c^t}) + b_\phi^{t+1}\epsilon_s^{t+1} + \omega_{c\iota}\delta_\iota^{t+1} + \omega_{c\phi}\delta_\phi^{t+1}\\ =& \delta_\omega^t \omega_{c\omega} + \epsilon_c^t b_\omega^t h'(s_c^t) + b_\phi^{t+1}\epsilon_s^{t+1} + \omega_{c\iota}\delta_\iota^{t+1} + \omega_{c\phi}\delta_\phi^{t+1} \end{align}$

那么：

δ t c = ϵ t s b t ι g' (a t c)

$\delta_c^t = \epsilon_s^t b_\iota^tg'(a_c^t)$

Cell

细心的读者会发现，论文中 $\frac{\partial \mathcal{L}(x,z)}{\partial b_c^t}$ 并没有求和，这里作者持保留态度，应该存在求和项。

同时由于Cell可以连接到下一个时刻的Forget Gate、Output Gate和Input Gate，那么下一时刻的这三个Gate则可以将梯度传播回来，所以在论文中我们会发现 $\epsilon_s^t$ 拥有这三项： $b_\phi^{t+1} \epsilon_s^{t+1}$ 、 $\omega_{cl}\delta_\iota^{t+1}$ 和 $\omega_{c\phi}\delta_\phi^{t+1}$ 。

5. Forget Gate的梯度

Forget Gate的梯度计算就比较简单明了：

δ t ϕ = \partial  ( x , z ) \partial a t ϕ = \partial  ( x , z ) \partial s t c \partial s t c \partial b t ϕ \partial b t ϕ \partial a t ϕ = ϵ t s s t - 1 c f' (a t ϕ)

$\delta_\phi^t = \frac{\partial \mathcal{L}(x,z)}{\partial a_\phi^t} = \frac{\partial \mathcal{L}(x,z)}{\partial s_c^t} \frac{\partial s_c^t}{\partial b_\phi^t} \frac{\partial b_\phi^t}{\partial a_\phi^t}=\epsilon_s^t s_c^{t-1} f'(a_\phi^t)$

Forget Gate

另外，由于单个Block内可以存在多个memory cell、一个Forget Gate、一个Input Gate和一个Output Gate，论文中将Forget Gate的梯度写成了 $f'(a_\phi^t) \sum_{c=1}^{C} s_c^{t-1} \epsilon_s^t$ ，但推导过程一致，说明梯度汇总到单个Gate中。

6. Input Gate的梯度

Input Gate的梯度计算如下：

δ t ι = \partial  ( x , z ) \partial a t ι = \partial  ( x , z ) \partial s t c \partial s t c \partial b t ι \partial b t ι \partial a t ι = ϵ t s g (a t c) f' (a t ι)

$\delta_\iota^t = \frac{\partial \mathcal{L}(x,z)}{\partial a_\iota^t} = \frac{\partial \mathcal{L}(x,z)}{\partial s_c^t} \frac{\partial s_c^t}{\partial b_\iota^t} \frac{\partial b_\iota^t}{\partial a_\iota^t}=\epsilon_s^t g(a_c^t) f'(a_\iota^t)$

Input Gate

另外，由于单个Block内可以存在多个memory cell、一个Forget Gate、一个Input Gate和一个Output Gate，论文中将Input Gate的梯度写成了 $f'(a_\iota^t) \sum_{c=1}^{C} g(a_c^t)\epsilon_s^t$ ，但推导过程一致，说明梯度汇总到单个Gate中。