循环神经网络2--LSTM

最新推荐文章于 2022-04-29 20:46:53 发布

Rhine_Yu

最新推荐文章于 2022-04-29 20:46:53 发布

阅读量1.3k

点赞数

分类专栏： Deep Learning 文章标签： RNN LSTM DeepLearning

Deep Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

这周在看循环数据网络, 发现一个博客, 里面推导极其详细, 借此记录重点.

详细推导

强烈建议手推一遍, 虽然会花一点时间, 但便于理清思路.

长短时记忆网络

回顾BPTT算法里误差项沿时间反向传播的公式:

δ T k = δ T t \prod i = k t - 1 d i a g [f' (n e t i)] W (1)

$\begin{align} \delta_k^T=&\delta_t^T\prod_{i=k}^{t-1}diag[f'(\mathbf{net}_{i})]W\\ \end{align}$
根据范数的性质, 来获取

δTk δ k T $\delta_{k}^{T}$ 的模的上界:

‖ δ T k ‖ ⩽ ⩽ ‖ δ T t ‖ \prod i = k t - 1 ‖ d i a g [f' (n e t i)] ‖ ‖ W ‖ ‖ δ T t ‖ (β f β W) t - k (2) (3)

$\begin{align} \|\delta_k^T\|\leqslant&\|\delta_t^T\|\prod_{i=k}^{t-1}\|diag[f'(\mathbf{net}_{i})]\|\|W\|\\ \leqslant&\|\delta_t^T\|(\beta_f\beta_W)^{t-k} \end{align}$
可以看到, 误差项

δ δ $\delta$ 从t时刻传递到k时刻, 其值上界是

βfβw β f β w $\beta_{f}\beta_{w}$ 的指数函数.

βfβw β f β w $\beta_{f}\beta_{w}$ 分别是对角矩阵

diag[f′(neti)] d i a g [ f ′ ( n e t i ) ] $diag[f^{'}(net_{i})]$ 和矩阵W模的上界. 显然, 当t-k很大时, 会有 梯度爆炸, 当t-k很小时, 会有 梯度消失.

为了解决RNN的梯度爆炸和梯度消失的问题, 就出现了长短时记忆网络(Long Short Memory Network, LSTM). 原始RNN的隐藏层只有一个状态h, 它对于短期的输入非常敏感. 如果再增加一个状态c, 让它来保存长期的状态, 那么就可以解决原始RNN无法处理长距离依赖的问题.

新增加的状态c, 称为单元状态(cell state). 上图按照时间维度展开:

上图中, 在t时刻, LSTM的输入有三个: 当前时刻网络的输入值 $x_{t}$ , 上一时刻LSTM的输出值 $h_{t-1}$ , 以及上一时刻的单元状态 $c_{t-1}$ ; LSTM的输出有两个: 当前时刻的LSTM输出 $h_{t}$ , 当前时刻的状态 $c_{t}$ . 其中 $x, h, c$ 都是向量.

LSTM的关键在于怎样控制长期状态c. 在这里, LSTM的思路是使用三个控制开关:

第一个开关, 负责控制继续保存长期状态c; (遗忘门)

第二个开关, 负责控制把即时状态输入到长期状态c; (输入门)

第三个开关, 负责控制是都把长期状态c作为当前的LSTM的输出. (输出门)

接下来, 具体描述一下输出h和单元状态c的计算方法.

长短时记忆网络的前向计算

开关在算法中用门(gate)实现. 门实际上就是一层全连接层, 它的输入是一个向量, 输出是一个0~1的实数向量. 假设w是门的权重向量, b是偏置项, 门可以表示为:

g (x) = σ (W x + b)

$g(\mathbf{x})=\sigma(W\mathbf{x}+\mathbf{b})$
门的使用, 就是 用门的输出向量按元素乘以我们需要控制的那个向量. 当门的输出为0时, 任何向量与之相乘都会得到0向量, 相当于什么都不能通过; 当输出为1时, 任何向量与之相乘都为本身, 相当于什么都可以通过. 上式中

σ σ $\sigma$ 是sigmoid函数, 值域为(0,1), 所以门的状态是半开半闭的.

LSTM用两个门来控制单元状态c的内容, 一个是遗忘门(forget gate), 它决定了上一时刻的单元状态 $\mathbf{c}_{t-1}$ 有多少保留到当前时刻 $\mathbf{c}_{t}$ ; 另一个是输入门(input gate), 它决定了当前时刻网络的输入 $\mathbf{x}_{t}$ 有多少保存到单元状态 $\mathbf{c}_{t}$ . LSTM用输出门(output gate)来控制单元状态 $\mathbf{c}_{t}$ 有多少输出到LSTM的当前输出值 $\mathbf{h}_{t}$ .

1. 遗忘门:

f t = σ (W f \cdot [h t - 1, x t] + b f) (式 1)

$\mathbf{f}_t=\sigma(W_f\cdot[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_f)\qquad\quad(式1)$
上式中,

Wf W f $\mathbf{W}_{f}$ 是遗忘门的权重矩阵,

[ht−1,xt] [ h t − 1 , x t ] $[\mathbf{h}_{t-1},x_{t}]$ 表示把两个向量连接到一个更长的向量,

bf b f $\mathbf{b}_{f}$ 是遗忘门的偏置项,

σ σ $\sigma$ 是sigmoid函数. 如果输入的维度是

dh d h $d_{h}$ , 单元状态的维度是

dc d c $d_{c}$ (通常

dc=dh d c = d h $d_{c}=d_{h}$ ), 则遗忘门的权重矩阵

Wf W f $\mathbf{W}_{f}$ 维度是

dc×(dh+dx) d c × ( d h + d x ) $d_{c}×(d_{h}+d_{x})$ .

事实上, 权重矩阵 $\mathbf{W}_{f}$ 都是两个矩阵拼接而成的: 一个是 $\mathbf{W}_{fh}$ , 它对应着输入项 $\mathbf{h}_{t-1}$ , 其维度为 $d_{c}×d_{h}$ ; 一个是 $\mathbf{W}_{fx}$ , 它对应着输入项 $\mathbf{x}_{t}$ , 其维度为 $d_{c}×d_{h}$ . $\mathbf{W}_{f}$ 可以写成:

[W f] [h t - 1 x t] = [W f h W f x] [h t - 1 x t] = W f h h t - 1 + W f x x t (4) (5)

$\begin{align} \begin{bmatrix}W_f\end{bmatrix}\begin{bmatrix}\mathbf{h}_{t-1}\\ \mathbf{x}_t\end{bmatrix}&= \begin{bmatrix}W_{fh}&W_{fx}\end{bmatrix}\begin{bmatrix}\mathbf{h}_{t-1}\\ \mathbf{x}_t\end{bmatrix}\\ &=W_{fh}\mathbf{h}_{t-1}+W_{fx}\mathbf{x}_t \end{align}$
下图是遗忘门的计算:

2. 输入门:

i t = σ (W i \cdot [h t - 1, x t] + b i) (式 2)

$\mathbf{i}_t=\sigma(W_i\cdot[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i)\qquad\quad(式2)$
上式中,

Wi W i $\mathbf{W}_{i}$ 是输入门的权重矩阵,

bi b i $\mathbf{b}_{i}$ 是输入门的偏置项.

下图是输入门的计算:

接下来, 计算用于描述当前输入的单元状态 $\mathbf{\tilde{c}}_{t}$ , 它是根据根据上一次的输出和本次的输入来计算的:

c ̃ t = tanh (W c \cdot [h t - 1, x t] + b c) (式 3)

$\mathbf{\tilde{c}}_t=\tanh(W_c\cdot[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_c)\qquad\quad(式3)$
下图是

c̃t c ~ t $\mathbf{\tilde{c}_{t}}$ 的计算:

现在, 我们计算当前时刻的单元状态 $\mathbf{c}_{t}$ . 它是由上一次的单元状态 $\mathbf{c}_{t-1}$ 按元素乘以遗忘门 $\mathbf{f}_{t}$ , 再用当前输入的单元状态 $\mathbf{\tilde{c}_{t}}$ 按元素乘以输入门 $\mathbf{i}_{t}$ , 再将两个积加和产生的:

c t = f t \circ c t - 1 + i t \circ c ̃ t (式 4)

$\mathbf{c}_t=f_t\circ{\mathbf{c}_{t-1}}+i_t\circ{\mathbf{\tilde{c}}_t}\qquad\quad(式4)$
符号

∘ ∘ $\circ$ 表示 按元素乘. 下图是

ct c t $\mathbf{c}_{t}$ 的计算:

这样, 就把LSTM关于当前的记忆 $\mathbf{\tilde{c}}_{t}$ 和长期的记忆 $\mathbf{c}_{t-1}$ 组合在一起, 形成了新的单元状态 $\mathbf{c}_{t}$ . 由于遗忘门的控制, 它可以保存很久之前的信息, 由于输入门的控制, 它又可以避免当前无关紧要的内容进入记忆.

3. 输出门

o t = σ (W o \cdot [h t - 1, x t] + b o) (式 5)

$\mathbf{o}_t=\sigma(W_o\cdot[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o)\qquad\quad(式5)$
下图表示输出门的计算:

LSTM最终的输出, 是由输出门和单元状态共同确定的:

h t = o t \circ tanh (c t) (式 6)

$\mathbf{h}_{t}=\mathbf{o}_{t}\circ\tanh(\mathbf{c}_{t})\qquad\quad(式6)$
下图表示LSTM最终输出的计算:

式1到式6就是LSTM前向计算的全部公式.

长短时记忆网络的训练

训练部分比前向计算部分复杂, 具体推导如下.

LSTM训练算法框架

LSTM的训练算法仍然是反向传播算法, 主要是三个步骤:

前向计算每个神经元的输出值, 对于LSTM来说, 即 $\mathbf{f}_{t}, \mathbf{i}_{t},\mathbf{c}_{t} \mathbf{o}_{t}, \mathbf{h}_{t}$ 五个向量的值;
反向计算每个神经元的误差项 $\delta$ 值, 与RNN一样, LSTM误差项的反向传播也是包括两个方向: 一个沿时间的反向传播, 即从当前t时刻开始, 计算每个时刻的误差项; 一个是将误差项向上一层传播;
根据相应的误差项, 计算每个权重的梯度.

关于公式和符号的说明

接下来的推导, 设定gate的激活函数为sigmoid, 输出的激活函数为tanh函数. 他们的导数分别为:

σ (z) σ' (z) tanh (z) tanh' (z) = y = 1 1 + e - z = y (1 - y) = y = e z - e - z e z + e - z = 1 - y 2 (6) (7) (8) (9)

$\begin{align} \sigma(z)&=y=\frac{1}{1+e^{-z}}\\ \sigma'(z)&=y(1-y)\\ \tanh(z)&=y=\frac{e^z-e^{-z}}{e^z+e^{-z}}\\ \tanh'(z)&=1-y^2 \end{align}$
从上式知, sigmoid函数和tanh函数的导数都是原函数的函数, 那么计算出原函数的值, 导数便也计算出来.

LSTM需要学习的参数共有8组, 权重矩阵的两部分在反向传播中使用不同的公式, 分别是:

遗忘门的权重矩阵 $\mathbf{W}_{f}$ 和偏置项 $\mathbf{b}_{t}$ , $\mathbf{W}_{f}$ 分开为两个矩阵 $\mathbf{W}_{fh}$ 和 $\mathbf{W}_{fx}$
输入门的权重矩阵 $\mathbf{W}_{i}$ 和偏置项 $\mathbf{b}_{i}$ , $\mathbf{W}_{i}$ 分开为两个矩阵 $\mathbf{W}_{ih}$ 和 $\mathbf{W}_{xi}$
输出门的权重矩阵 $\mathbf{W}_{o}$ 和偏置项 $\mathbf{b}_{o}$ , $\mathbf{W}_{o}$ 分开为两个矩阵 $\mathbf{W}_{oh}$ 和 $\mathbf{W}_{ox}$
计算单元状态的权重矩阵 $\mathbf{W}_{c}$ 和偏置项 $\mathbf{b}_{c}$ , $\mathbf{W}_{c}$ 分开为两个矩阵 $\mathbf{W}_{ch}$ 和 $\mathbf{W}_{cx}$

按元素乘 $\circ$ 符号. 当 $\circ$ 作用于两个向量时, 运算如下:

a \circ b = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ a 1 a 2 a 3 . . . a n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ \circ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ b 1 b 2 b 3 . . . b n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ a 1 b 1 a 2 b 2 a 3 b 3 . . . a n b n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

$\mathbf{a}\circ\mathbf{b}=\begin{bmatrix} a_1\\a_2\\a_3\\...\\a_n \end{bmatrix}\circ\begin{bmatrix} b_1\\b_2\\b_3\\...\\b_n \end{bmatrix}=\begin{bmatrix} a_1b_1\\a_2b_2\\a_3b_3\\...\\a_nb_n \end{bmatrix}$
当

∘ ∘ $\circ$ 作用于 一个向量和 一个矩阵时, 运算如下:

a \circ X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ a 1 a 2 a 3 . . . a n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ \circ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ x 11 x 21 x 31 x n 1 x 12 x 22 x 32 x n 2 x 13 x 23 x 33 . . . x n 3 . . . . . . . . . . . . x 1 n x 2 n x 3 n x n n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ a 1 x 11 a 2 x 21 a 3 x 31 a n x n 1 a 1 x 12 a 2 x 22 a 3 x 32 a n x n 2 a 1 x 13 a 2 x 23 a 3 x 33 . . . a n x n 3 . . . . . . . . . . . . a 1 x 1 n a 2 x 2 n a 3 x 3 n a n x n n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ (10) (11)

$\begin{align} \mathbf{a}\circ X&=\begin{bmatrix} a_1\\a_2\\a_3\\...\\a_n \end{bmatrix}\circ\begin{bmatrix} x_{11} & x_{12} & x_{13} & ... & x_{1n}\\ x_{21} & x_{22} & x_{23} & ... & x_{2n}\\ x_{31} & x_{32} & x_{33} & ... & x_{3n}\\ & & ...\\ x_{n1} & x_{n2} & x_{n3} & ... & x_{nn}\\ \end{bmatrix}\\ &=\begin{bmatrix} a_1x_{11} & a_1x_{12} & a_1x_{13} & ... & a_1x_{1n}\\ a_2x_{21} & a_2x_{22} & a_2x_{23} & ... & a_2x_{2n}\\ a_3x_{31} & a_3x_{32} & a_3x_{33} & ... & a_3x_{3n}\\ & & ...\\ a_nx_{n1} & a_nx_{n2} & a_nx_{n3} & ... & a_nx_{nn}\\ \end{bmatrix} \end{align}$
当

∘ ∘ $\circ$ 作用于 两个矩阵时, 两个矩阵对应位置的元素相乘. 按元素乘可以在某些情况下简化矩阵和向量运算.

例如, 当一个对角矩阵右乘一个矩阵时, 相当于用对角矩阵的对角线组成的向量按元素乘那个矩阵:

d i a g [a] X = a \circ X

$diag[\mathbf{a}]X=\mathbf{a}\circ X$
当一个行向量左乘一个对角矩阵时, 相当于这个行向量按元素乘那个矩阵对角组成的向量:

a T d i a g [b] = a \circ b

$\mathbf{a}^{T}diag[\mathbf{b}]=\mathbf{a}\circ \mathbf{b}$
在t时刻, LSTM的输出值为

ht h t $\mathbf{h}_{t}$ . 我们定义t时刻的误差项

δt δ t $\delta_{t}$ 为:

δ t = d e f \partial E \partial h t

$\delta_{t} \overset{def}=\frac{\partial \mathbf{E}}{\partial \mathbf {h}_{t}}$
这里假设误差项是损失函数对输出值的导数, 而不是对加权输出

netlt n e t t l $net_{t}^{l}$ 的导数. 因为LSTM有四个加权输入, 分别对应

ft,it,ct,ot f t , i t , c t , o t $\mathbf{f}_{t}, \mathbf{i}_{t}, \mathbf{c}_{t}, \mathbf{o}_{t}$ , 我们希望往上一层传递一个误差项而不是四个, 但需要定义这四个加权输入以及它们对应的误差项.

n e t f, t n e t i, t n e t c ̃, t n e t o, t δ f, t δ i, t δ c ̃, t δ o, t = W f [h t - 1, x t] + b f = W f h h t - 1 + W f x x t + b f = W i [h t - 1, x t] + b i = W i h h t - 1 + W i x x t + b i = W c [h t - 1, x t] + b c = W c h h t - 1 + W c x x t + b c = W o [h t - 1, x t] + b o = W o h h t - 1 + W o x x t + b o = d e f \partial E \partial n e t f , t = d e f \partial E \partial n e t i , t = d e f \partial E \partial n e t c ̃ , t = d e f \partial E \partial n e t o , t (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23)

$\begin{align} \mathbf{net}_{f,t}&=W_f[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_f\\ &=W_{fh}\mathbf{h}_{t-1}+W_{fx}\mathbf{x}_t+\mathbf{b}_f\\ \mathbf{net}_{i,t}&=W_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i\\ &=W_{ih}\mathbf{h}_{t-1}+W_{ix}\mathbf{x}_t+\mathbf{b}_i\\ \mathbf{net}_{\tilde{c},t}&=W_c[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_c\\ &=W_{ch}\mathbf{h}_{t-1}+W_{cx}\mathbf{x}_t+\mathbf{b}_c\\ \mathbf{net}_{o,t}&=W_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o\\ &=W_{oh}\mathbf{h}_{t-1}+W_{ox}\mathbf{x}_t+\mathbf{b}_o\\ \delta_{f,t}&\overset{def}{=}\frac{\partial{E}}{\partial{\mathbf{net}_{f,t}}}\\ \delta_{i,t}&\overset{def}{=}\frac{\partial{E}}{\partial{\mathbf{net}_{i,t}}}\\ \delta_{\tilde{c},t}&\overset{def}{=}\frac{\partial{E}}{\partial{\mathbf{net}_{\tilde{c},t}}}\\ \delta_{o,t}&\overset{def}{=}\frac{\partial{E}}{\partial{\mathbf{net}_{o,t}}}\\ \end{align}$

误差项沿时间的反向传递

沿时间反向传递误差项, 就是要计算出t-1时刻的误差项 $\delta_{t-1}$ .

δ T t - 1 = \partial E \partial h t - 1 = \partial E \partial h t \partial h t \partial h t - 1 = δ T t \partial h t \partial h t - 1 (24) (25) (26)

$\begin{align} \delta_{t-1}^T&=\frac{\partial{E}}{\partial{\mathbf{h_{t-1}}}}\\ &=\frac{\partial{E}}{\partial{\mathbf{h_t}}}\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{h_{t-1}}}}\\ &=\delta_{t}^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{h_{t-1}}}} \end{align}$
其中,

∂ht∂ht−1 ∂ h t ∂ h t − 1 $\frac{\partial \mathbf{h_t}}{\partial \mathbf{h_{t-1}}}$ 是一个Jacobian矩阵, 为了求出它, 需要列出

ht h t $\mathbf{h_t}$ 的计算公式, 即前面的式6和式4:

h t = o t \circ tanh (c t) (式 6) c t = f t \circ c t - 1 + i t \circ c ̃ t (式 4)

$\mathbf{h}_{t}=\mathbf{o}_{t}\circ\tanh(\mathbf{c}_{t})\qquad\quad(式6) \\ \mathbf{c}_t=f_t\circ{\mathbf{c}_{t-1}}+i_t\circ{\mathbf{\tilde{c}}_t}\qquad\quad(式4)$
显然,

ot,ft,it,c̃t o t , f t , i t , c ~ t $\mathbf{o_t}, \mathbf{f_t}, \mathbf{i_t}, \mathbf{\tilde{c}_t}$ 都是

ht−1 h t − 1 $\mathbf{h_{t-1}}$ 的函数, 那么, 利用全导数公式可得:

δ T t \partial h t \partial h t - 1 = δ T t \partial h t \partial o t \partial o t \partial n e t o , t \partial n e t o , t \partial h t - 1 + δ T t \partial h t \partial c t \partial c t \partial f t \partial f t \partial n e t f , t \partial n e t f , t \partial h t - 1 + δ T t \partial h t \partial c t \partial c t \partial i t \partial i t \partial n e t i , t \partial n e t i , t \partial h t - 1 + δ T t \partial h t \partial c t \partial c t \partial c ̃ t \partial c ̃ t \partial n e t c ̃ , t \partial n e t c ̃ , t \partial h t - 1 = δ T o, t \partial n e t o , t \partial h t - 1 + δ T f, t \partial n e t f , t \partial h t - 1 + δ T i, t \partial n e t i , t \partial h t - 1 + δ T c ̃, t \partial n e t c ̃ , t \partial h t - 1 (式 7) (27) (28) (29)

$\begin{align} \delta_t^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{h_{t-1}}}}&=\delta_t^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{o}_t}}\frac{\partial{\mathbf{o}_t}}{\partial{\mathbf{net}_{o,t}}}\frac{\partial{\mathbf{net}_{o,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_t^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{c}_t}}\frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{f_{t}}}}\frac{\partial{\mathbf{f}_t}}{\partial{\mathbf{net}_{f,t}}}\frac{\partial{\mathbf{net}_{f,t}}}{\partial{\mathbf{h_{t-1}}}}\\ &+\delta_t^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{c}_t}}\frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{i_{t}}}}\frac{\partial{\mathbf{i}_t}}{\partial{\mathbf{net}_{i,t}}}\frac{\partial{\mathbf{net}_{i,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_t^T\frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{c}_t}}\frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{\tilde{c}}_{t}}}\frac{\partial{\mathbf{\tilde{c}}_t}}{\partial{\mathbf{net}_{\tilde{c},t}}}\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{\mathbf{h_{t-1}}}}\\ &=\delta_{o,t}^T\frac{\partial{\mathbf{net}_{o,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{f,t}^T\frac{\partial{\mathbf{net}_{f,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{i,t}^T\frac{\partial{\mathbf{net}_{i,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{\tilde{c},t}^T\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{\mathbf{h_{t-1}}}}\qquad\quad(式7) \end{align}$
下面, 要把式7中的每个偏导数都求出来, 根据式6, 可以求出:

\partial h t \partial o t \partial h t \partial c t = d i a g [tanh (c t)] = d i a g [o t \circ (1 - tanh (c t) 2)] (30) (31)

$\begin{align} \frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{o}_t}}&=diag[\tanh(\mathbf{c}_t)]\\ \frac{\partial{\mathbf{h_t}}}{\partial{\mathbf{c}_t}}&=diag[\mathbf{o}_t\circ(1-\tanh(\mathbf{c}_t)^2)] \end{align}$
根据式4, 可以求出:

\partial c t \partial f t \partial c t \partial i t \partial c t \partial c ̃ t = d i a g [c t - 1] = d i a g [c ̃ t] = d i a g [i t] (32) (33) (34)

$\begin{align} \frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{f_{t}}}}&=diag[\mathbf{c}_{t-1}]\\ \frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{i_{t}}}}&=diag[\mathbf{\tilde{c}}_t]\\ \frac{\partial{\mathbf{c}_t}}{\partial{\mathbf{\tilde{c}_{t}}}}&=diag[\mathbf{i}_t]\\ \end{align}$
因为:

o t n e t o, t f t n e t f, t i t n e t i, t c ̃ t n e t c ̃, t = σ (n e t o, t) = W o h h t - 1 + W o x x t + b o = σ (n e t f, t) = W f h h t - 1 + W f x x t + b f = σ (n e t i, t) = W i h h t - 1 + W i x x t + b i = tanh (n e t c ̃, t) = W c h h t - 1 + W c x x t + b c (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45)

$\begin{align} \mathbf{o}_t&=\sigma(\mathbf{net}_{o,t})\\ \mathbf{net}_{o,t}&=W_{oh}\mathbf{h}_{t-1}+W_{ox}\mathbf{x}_t+\mathbf{b}_o\\\\ \mathbf{f}_t&=\sigma(\mathbf{net}_{f,t})\\ \mathbf{net}_{f,t}&=W_{fh}\mathbf{h}_{t-1}+W_{fx}\mathbf{x}_t+\mathbf{b}_f\\\\ \mathbf{i}_t&=\sigma(\mathbf{net}_{i,t})\\ \mathbf{net}_{i,t}&=W_{ih}\mathbf{h}_{t-1}+W_{ix}\mathbf{x}_t+\mathbf{b}_i\\\\ \mathbf{\tilde{c}}_t&=\tanh(\mathbf{net}_{\tilde{c},t})\\ \mathbf{net}_{\tilde{c},t}&=W_{ch}\mathbf{h}_{t-1}+W_{cx}\mathbf{x}_t+\mathbf{b}_c\\ \end{align}$
可以得出:

\partial o t \partial n e t o , t \partial n e t o , t \partial h t - 1 \partial f t \partial n e t f , t \partial n e t f , t \partial h t - 1 \partial i t \partial n e t i , t \partial n e t i , t \partial h t - 1 \partial c ̃ t \partial n e t c ̃ , t \partial n e t c ̃ , t \partial h t - 1 = d i a g [o t \circ (1 - o t)] = W o h = d i a g [f t \circ (1 - f t)] = W f h = d i a g [i t \circ (1 - i t)] = W i h = d i a g [1 - c ̃ 2 t] = W c h (46) (47) (48) (49) (50) (51) (52) (53)

$\begin{align} \frac{\partial{\mathbf{o}_t}}{\partial{\mathbf{net}_{o,t}}}&=diag[\mathbf{o}_t\circ(1-\mathbf{o}_t)]\\ \frac{\partial{\mathbf{net}_{o,t}}}{\partial{\mathbf{h_{t-1}}}}&=W_{oh}\\ \frac{\partial{\mathbf{f}_t}}{\partial{\mathbf{net}_{f,t}}}&=diag[\mathbf{f}_t\circ(1-\mathbf{f}_t)]\\ \frac{\partial{\mathbf{net}_{f,t}}}{\partial{\mathbf{h}_{t-1}}}&=W_{fh}\\ \frac{\partial{\mathbf{i}_t}}{\partial{\mathbf{net}_{i,t}}}&=diag[\mathbf{i}_t\circ(1-\mathbf{i}_t)]\\ \frac{\partial{\mathbf{net}_{i,t}}}{\partial{\mathbf{h}_{t-1}}}&=W_{ih}\\ \frac{\partial{\mathbf{\tilde{c}}_t}}{\partial{\mathbf{net}_{\tilde{c},t}}}&=diag[1-\mathbf{\tilde{c}}_t^2]\\ \frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{\mathbf{h}_{t-1}}}&=W_{ch} \end{align}$
将上述偏导数导入到式7, 可以得到:

δ t - 1 = δ T o, t \partial n e t o , t \partial h t - 1 + δ T f, t \partial n e t f , t \partial h t - 1 + δ T i, t \partial n e t i , t \partial h t - 1 + δ T c ̃, t \partial n e t c ̃ , t \partial h t - 1 = δ T o, t W o h + δ T f, t W f h + δ T i, t W i h + δ T c ̃, t W c h (式 8) (54) (55)

$\begin{align} \delta_{t-1}&=\delta_{o,t}^T\frac{\partial{\mathbf{net}_{o,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{f,t}^T\frac{\partial{\mathbf{net}_{f,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{i,t}^T\frac{\partial{\mathbf{net}_{i,t}}}{\partial{\mathbf{h_{t-1}}}} +\delta_{\tilde{c},t}^T\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{\mathbf{h_{t-1}}}}\\ &=\delta_{o,t}^T W_{oh} +\delta_{f,t}^TW_{fh} +\delta_{i,t}^TW_{ih} +\delta_{\tilde{c},t}^TW_{ch}\qquad\quad(式8)\\ \end{align}$
根据

δo,t,δf,t,δi,t,δc̃,t δ o , t , δ f , t , δ i , t , δ c ~ , t $\delta_{o,t}, \delta_{f,t}, \delta_{i,t}, \delta_{\tilde{c},t}$ 的定义, 可知:

δ T o, t δ T f, t δ T i, t δ T c ̃, t = δ T t \circ tanh (c t) \circ o t \circ (1 - o t) (式 9) = δ T t \circ o t \circ (1 - tanh (c t) 2) \circ c t - 1 \circ f t \circ (1 - f t) (式 10) = δ T t \circ o t \circ (1 - tanh (c t) 2) \circ c ̃ t \circ i t \circ (1 - i t) (式 11) = δ T t \circ o t \circ (1 - tanh (c t) 2) \circ i t \circ (1 - c ̃ 2) (式 12) (56) (57) (58) (59)

$\begin{align} \delta_{o,t}^T&=\delta_t^T\circ\tanh(\mathbf{c}_t)\circ\mathbf{o}_t\circ(1-\mathbf{o}_t)\qquad\quad(式9)\\ \delta_{f,t}^T&=\delta_t^T\circ\mathbf{o}_t\circ(1-\tanh(\mathbf{c}_t)^2)\circ\mathbf{c}_{t-1}\circ\mathbf{f}_t\circ(1-\mathbf{f}_t)\qquad(式10)\\ \delta_{i,t}^T&=\delta_t^T\circ\mathbf{o}_t\circ(1-\tanh(\mathbf{c}_t)^2)\circ\mathbf{\tilde{c}}_t\circ\mathbf{i}_t\circ(1-\mathbf{i}_t)\qquad\quad(式11)\\ \delta_{\tilde{c},t}^T&=\delta_t^T\circ\mathbf{o}_t\circ(1-\tanh(\mathbf{c}_t)^2)\circ\mathbf{i}_t\circ(1-\mathbf{\tilde{c}}^2)\qquad\quad(式12)\\ \end{align}$
式8到 式12就是将误差沿时间反向传播一个时刻的公式. 有了它, 便可以写出将误差项传递到任意k时刻的公式:

δ T k = \prod j = k t - 1 δ T o, j W o h + δ T f, j W f h + δ T i, j W i h + δ T c ̃, j W c h (式 13)

$\delta_k^T=\prod_{j=k}^{t-1}\delta_{o,j}^TW_{oh} +\delta_{f,j}^TW_{fh} +\delta_{i,j}^TW_{ih} +\delta_{\tilde{c},j}^TW_{ch}\qquad\quad(式13)$

将误差项传递到上一层

假设当前是第 $l$ 层, 定义 $l-1$ 层的误差项是误差函数对 $l-1$ 层加权输入的导数, 即:

δ l - 1 t = d e f \partial E n e t l - 1 t

$\delta_t^{l-1}\overset{def}{=}\frac{\partial{E}}{\mathbf{net}_t^{l-1}}$
本次LSTM的输入

xt x t $x_{t}$ 由下面的公式计算:

x l t = f l - 1 (n e t l - 1 t)

$\mathbf{x}_{t}^{l}=\mathbf{f}^{l-1}(\mathbf{net}_{t}^{l-1})$
上式中,

fl−1 f l − 1 $\mathbf{f}^{l-1}$ 表示第

l−1 l − 1 $l-1$ 的 激活函数.

因为 $\mathbf{net}_{f,t}^{l}, \mathbf{net}_{i,t}^{l}, \mathbf{net}_{\tilde{c},t}^{l}, \mathbf{net}_{o,t}^{l}$ 都是 $\mathbf{x_t}$ 的函数, $\mathbf{x_t}$ 又是 $\mathbf{net}_{t}^{l-1}$ 的函数, 因此, 要求出 $\mathbf{E}$ 对 $\mathbf{net}_{t}^{l-1}$ 的导数, 就需要使用全导数公式:

\partial E \partial n e t l - 1 t = \partial E \partial n e t l f , t \partial n e t l f , t \partial x l t \partial x l t \partial n e t l - 1 t + \partial E \partial n e t l i , t \partial n e t l i , t \partial x l t \partial x l t \partial n e t l - 1 t + \partial E \partial n e t l c ̃ , t \partial n e t l c ̃ , t \partial x l t \partial x l t \partial n e t l - 1 t + \partial E \partial n e t l o , t \partial n e t l o , t \partial x l t \partial x l t \partial n e t l - 1 t = δ T f, t W f x \circ f' (n e t l - 1 t) + δ T i, t W i x \circ f' (n e t l - 1 t) + δ T c ̃, t W c x \circ f' (n e t l - 1 t) + δ T o, t W o x \circ f' (n e t l - 1 t) = (δ T f, t W f x + δ T i, t W i x + δ T c ̃, t W c x + δ T o, t W o x) \circ f' (n e t l - 1 t) (式 14) (60) (61) (62) (63)

$\begin{align} \frac{\partial{E}}{\partial{\mathbf{net}_t^{l-1}}}&=\frac{\partial{E}}{\partial{\mathbf{\mathbf{net}_{f,t}^l}}}\frac{\partial{\mathbf{\mathbf{net}_{f,t}^l}}}{\partial{\mathbf{x}_t^l}}\frac{\partial{\mathbf{x}_t^l}}{\partial{\mathbf{\mathbf{net}_t^{l-1}}}} +\frac{\partial{E}}{\partial{\mathbf{\mathbf{net}_{i,t}^l}}}\frac{\partial{\mathbf{\mathbf{net}_{i,t}^l}}}{\partial{\mathbf{x}_t^l}}\frac{\partial{\mathbf{x}_t^l}}{\partial{\mathbf{\mathbf{net}_t^{l-1}}}}\\ &+\frac{\partial{E}}{\partial{\mathbf{\mathbf{net}_{\tilde{c},t}^l}}}\frac{\partial{\mathbf{\mathbf{net}_{\tilde{c},t}^l}}}{\partial{\mathbf{x}_t^l}}\frac{\partial{\mathbf{x}_t^l}}{\partial{\mathbf{\mathbf{net}_t^{l-1}}}} +\frac{\partial{E}}{\partial{\mathbf{\mathbf{net}_{o,t}^l}}}\frac{\partial{\mathbf{\mathbf{net}_{o,t}^l}}}{\partial{\mathbf{x}_t^l}}\frac{\partial{\mathbf{x}_t^l}}{\partial{\mathbf{\mathbf{net}_t^{l-1}}}}\\ &=\delta_{f,t}^TW_{fx}\circ f'(\mathbf{net}_t^{l-1})+\delta_{i,t}^TW_{ix}\circ f'(\mathbf{net}_t^{l-1})+\delta_{\tilde{c},t}^TW_{cx}\circ f'(\mathbf{net}_t^{l-1})+\delta_{o,t}^TW_{ox}\circ f'(\mathbf{net}_t^{l-1})\\ &=(\delta_{f,t}^TW_{fx}+\delta_{i,t}^TW_{ix}+\delta_{\tilde{c},t}^TW_{cx}+\delta_{o,t}^TW_{ox})\circ f'(\mathbf{net}_t^{l-1})\qquad\quad(式14) \end{align}$
式14就是将误差传递到上一层的公式.

权重梯度的计算

对于 $\mathbf{W}_{fh}, \mathbf{W}_{ih}, \mathbf{W}_{ch}, \mathbf{W}_{oh}$ 的权重梯度, 我们知道它的梯度是各个时刻梯度之和. 我们首先求出它们在t时刻的梯度, 然后再求出他们最终的梯度.

我们已经求得了误差项 $\delta_{o,t}, \delta_{f,t}, \delta_{i,t}, \delta_{\tilde{c},t}$ , 很容易求出t时刻的 $\mathbf{W}_{oh}, \mathbf{W}_{fh}, \mathbf{W}_{ih}, \mathbf{W}_{ch}$ :

\partial E \partial W o h , t \partial E \partial W f h , t \partial E \partial W i h , t \partial E \partial W c h , t = \partial E \partial n e t o , t \partial n e t o , t \partial W o h , t = δ o, t h T t - 1 = \partial E \partial n e t f , t \partial n e t f , t \partial W f h , t = δ f, t h T t - 1 = \partial E \partial n e t i , t \partial n e t i , t \partial W i h , t = δ i, t h T t - 1 = \partial E \partial n e t c ̃ , t \partial n e t c ̃ , t \partial W c h , t = δ c ̃, t h T t - 1 (64) (65) (66) (67) (68) (69) (70) (71) (72) (73) (74)

$\begin{align} \frac{\partial{E}}{\partial{W_{oh,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{o,t}}}\frac{\partial{\mathbf{net}_{o,t}}}{\partial{W_{oh,t}}}\\ &=\delta_{o,t}\mathbf{h}_{t-1}^T\\\\ \frac{\partial{E}}{\partial{W_{fh,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{f,t}}}\frac{\partial{\mathbf{net}_{f,t}}}{\partial{W_{fh,t}}}\\ &=\delta_{f,t}\mathbf{h}_{t-1}^T\\\\ \frac{\partial{E}}{\partial{W_{ih,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{i,t}}}\frac{\partial{\mathbf{net}_{i,t}}}{\partial{W_{ih,t}}}\\ &=\delta_{i,t}\mathbf{h}_{t-1}^T\\\\ \frac{\partial{E}}{\partial{W_{ch,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{\tilde{c},t}}}\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{W_{ch,t}}}\\ &=\delta_{\tilde{c},t}\mathbf{h}_{t-1}^T\\ \end{align}$

将各个时刻的梯度加在一起, 就能得到最终的梯度:

\partial E \partial W o h \partial E \partial W f h \partial E \partial W i h \partial E \partial W c h = \sum j = 1 t δ o, j h T j - 1 = \sum j = 1 t δ f, j h T j - 1 = \sum j = 1 t δ i, j h T j - 1 = \sum j = 1 t δ c ̃, j h T j - 1 (75) (76) (77) (78)

$\begin{align} \frac{\partial{E}}{\partial{W_{oh}}}&=\sum_{j=1}^t\delta_{o,j}\mathbf{h}_{j-1}^T\\ \frac{\partial{E}}{\partial{W_{fh}}}&=\sum_{j=1}^t\delta_{f,j}\mathbf{h}_{j-1}^T\\ \frac{\partial{E}}{\partial{W_{ih}}}&=\sum_{j=1}^t\delta_{i,j}\mathbf{h}_{j-1}^T\\ \frac{\partial{E}}{\partial{W_{ch}}}&=\sum_{j=1}^t\delta_{\tilde{c},j}\mathbf{h}_{j-1}^T\\ \end{align}$
对于偏置项

bf,bi,bc,bo b f , b i , b c , b o $\mathbf{b_f}, \mathbf{b_i}, \mathbf{b_c}, \mathbf{b_o}$ 的梯度, 先求出各个时刻的偏置项梯度:

\partial E \partial b o , t \partial E \partial b f , t \partial E \partial b i , t \partial E \partial b c , t = \partial E \partial n e t o , t \partial n e t o , t \partial b o , t = δ o, t = \partial E \partial n e t f , t \partial n e t f , t \partial b f , t = δ f, t = \partial E \partial n e t i , t \partial n e t i , t \partial b i , t = δ i, t = \partial E \partial n e t c ̃ , t \partial n e t c ̃ , t \partial b c , t = δ c ̃, t (79) (80) (81) (82) (83) (84) (85) (86) (87) (88) (89)

$\begin{align} \frac{\partial{E}}{\partial{\mathbf{b}_{o,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{o,t}}}\frac{\partial{\mathbf{net}_{o,t}}}{\partial{\mathbf{b}_{o,t}}}\\ &=\delta_{o,t}\\\\ \frac{\partial{E}}{\partial{\mathbf{b}_{f,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{f,t}}}\frac{\partial{\mathbf{net}_{f,t}}}{\partial{\mathbf{b}_{f,t}}}\\ &=\delta_{f,t}\\\\ \frac{\partial{E}}{\partial{\mathbf{b}_{i,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{i,t}}}\frac{\partial{\mathbf{net}_{i,t}}}{\partial{\mathbf{b}_{i,t}}}\\ &=\delta_{i,t}\\\\ \frac{\partial{E}}{\partial{\mathbf{b}_{c,t}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{\tilde{c},t}}}\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{\mathbf{b}_{c,t}}}\\ &=\delta_{\tilde{c},t}\\ \end{align}$
将各个时刻的偏置项梯度加在一起:

\partial E \partial b o \partial E \partial b i \partial E \partial b f \partial E \partial b c = \sum j = 1 t δ o, j = \sum j = 1 t δ i, j = \sum j = 1 t δ f, j = \sum j = 1 t δ c ̃, j (90) (91) (92) (93)

$\begin{align} \frac{\partial{E}}{\partial{\mathbf{b}_o}}&=\sum_{j=1}^t\delta_{o,j}\\ \frac{\partial{E}}{\partial{\mathbf{b}_i}}&=\sum_{j=1}^t\delta_{i,j}\\ \frac{\partial{E}}{\partial{\mathbf{b}_f}}&=\sum_{j=1}^t\delta_{f,j}\\ \frac{\partial{E}}{\partial{\mathbf{b}_c}}&=\sum_{j=1}^t\delta_{\tilde{c},j}\\ \end{align}$
对于

Wfx,Wix,Wcx,Wox W f x , W i x , W c x , W o x $\mathbf{W}_{fx}, \mathbf{W}_{ix}, \mathbf{W}_{cx}, \mathbf{W}_{ox}$ 的权重梯度, 只需要根据相应的误差项直接计算即可:

\partial E \partial W o x \partial E \partial W f x \partial E \partial W i x \partial E \partial W c x = \partial E \partial n e t o , t \partial n e t o , t \partial W o x = δ o, t x T t = \partial E \partial n e t f , t \partial n e t f , t \partial W f x = δ f, t x T t = \partial E \partial n e t i , t \partial n e t i , t \partial W i x = δ i, t x T t = \partial E \partial n e t c ̃ , t \partial n e t c ̃ , t \partial W c x = δ c ̃, t x T t (94) (95) (96) (97) (98) (99) (100) (101) (102) (103) (104)

$\begin{align} \frac{\partial{E}}{\partial{W_{ox}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{o,t}}}\frac{\partial{\mathbf{net}_{o,t}}}{\partial{W_{ox}}}\\ &=\delta_{o,t}\mathbf{x}_{t}^T\\\\ \frac{\partial{E}}{\partial{W_{fx}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{f,t}}}\frac{\partial{\mathbf{net}_{f,t}}}{\partial{W_{fx}}}\\ &=\delta_{f,t}\mathbf{x}_{t}^T\\\\ \frac{\partial{E}}{\partial{W_{ix}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{i,t}}}\frac{\partial{\mathbf{net}_{i,t}}}{\partial{W_{ix}}}\\ &=\delta_{i,t}\mathbf{x}_{t}^T\\\\ \frac{\partial{E}}{\partial{W_{cx}}}&=\frac{\partial{E}}{\partial{\mathbf{net}_{\tilde{c},t}}}\frac{\partial{\mathbf{net}_{\tilde{c},t}}}{\partial{W_{cx}}}\\ &=\delta_{\tilde{c},t}\mathbf{x}_{t}^T\\ \end{align}$
以上就是LSTM的训练算法的全部公式

GRU

上面所述是一种普通的LSTM, 事实上LSTM存在很多变体, GRU就是其中一种最成功的变体. 它对LSTM做了很多简化, 同时保持和LSTM相同的效果.

GRU对LSTM做了两大改动:

将输入门, 遗忘门, 输出门变为两个门: 更新门(Update Gate) $\mathbf{z}_{t}$ 和充值门(Reset Gate) $\mathbf{r_{t}}$ .
将单元状态与输出合并为一个状态: $\mathbf{h}$

GRU的前向计算公式为:

z t r t h ̃ t h = σ (W z \cdot [h t - 1, x t]) = σ (W r \cdot [h t - 1, x t]) = tanh (W \cdot [r t \circ h t - 1, x t]) = (1 - z t) \circ h t - 1 + z t \circ h ̃ t (105) (106) (107) (108)

$\begin{align} \mathbf{z}_t&=\sigma(W_z\cdot[\mathbf{h}_{t-1},\mathbf{x}_t])\\ \mathbf{r}_t&=\sigma(W_r\cdot[\mathbf{h}_{t-1},\mathbf{x}_t])\\ \mathbf{\tilde{h}}_t&=\tanh(W\cdot[\mathbf{r}_t\circ\mathbf{h}_{t-1},\mathbf{x}_t])\\ \mathbf{h}&=(1-\mathbf{z}_t)\circ\mathbf{h}_{t-1}+\mathbf{z}_t\circ\mathbf{\tilde{h}}_t \end{align}$
下图是GRU的示意图:

Rhine_Yu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
循环神经网络2--LSTM

这周在看循环数据网络, 发现一个博客, 里面推导极其详细, 借此记录重点.详细推导强烈建议手推一遍, 虽然会花一点时间, 但便于理清思路.长短时记忆网络回顾BPTT算法里误差项沿时间反向传播的公式: δTk=δTt∏i=kt−1diag[f′(neti)]W(1)(1)δkT=δtT∏i=kt−1diag[f′(neti)]W\begin{align}\delta_k^T...
复制链接

扫一扫