【深度学习】从循环神经网络(RNN)到LSTM和GRU

前言

【深度学习】从神经网络到卷积神经网络

前面我们介绍了 BP 神经网络和卷积神经网络CNN,那么为什么还需要循环神经网络 RNN 呢?

  • BP 神经网络和卷积神经网络CNN的输入输出都是相互独立的,但是在实际应用中有些场景输出内容和之前的内容是有关联的

BP 神经网络和卷积神经网络CNN 有一个特点,就是假设输入是一个独立的没有上下文联系的单位,比如输入是一张图片,网络识别是狗还是猫。但是对于一些有明显的上下文特征的序列化输入,比如预测视频中下一帧的播放内容,那么很明显这样的输出必须依赖以前的输入, 也就是说网络必须拥有一定的”记忆能力”。为了赋予网络这样的记忆力,一种特殊结构的神经网络——循环神经网络(Recurrent NeuralNetwork)便应运而生了。

  • RNN 引入“记忆”的概念,循环指其每一个元素都执行相同的任务,但是输出依赖于输入和“记忆”

RNN应用场景:自然语言处理、机器翻译、语音识别等

一、RNN(循环神经网络)

  循环神经网络是一类用于处理序列数据的神经网络,就像卷积神经网络是专门用于处理网格化数据(如一张图像)的神经网络,循环神经网络时专门用于处理序列 x ( 1 ) , . . . , x ( T ) x^{(1)},...,x^{(T)} x(1),...,x(T)的神经网络。

RNN 网络结构如下:

在这里插入图片描述

循环神经网络的结果相比于卷积神经网络较简单,通常循环神经网络只包含输入层、隐藏层和输出层,加上输入输出层最多也就5层

将序列按时间展开就可以得到RNN的结构,如下图:

在这里插入图片描述

网络某一时刻的输入 x t x_t xt,和之前介绍的BP神经网络的输入一样, x t x_t xt 是一个 n n n 维向量,不同的是递归网络的输入将是一整个序列,也就是 x = [ x 1 , . . . , x t − 1 , x t , x t + 1 , . . . x T ] x=[x_1,...,x_{t-1},x_t,x_{t+1},...x_T] x=[x1,...,xt1,xt,xt+1,...xT],对于语言模型,每一个 x t x_t xt将代表一个词向量,一整个序列就代表一句话。

  • h t h_t ht代表时刻 t t t隐神经元对于线性转换值
  • s t s_t st代表时刻 t t t的隐藏状态, 即:“记忆”
  • o t o_t ot代表时刻 t t t 的输出,
  • 输入层到隐藏层直接的权重由 U U U表示
  • 隐藏层到隐藏层的权重 W W W,它是网络的记忆控制者,负责调度记忆。
  • 隐藏层到输出层的权重 V V V
1、循环神经网络RNN-BPTT

  RNN 的训练和 CNN/ANN 训练一样,同样适用 BP算法误差反向传播算法

区别在于:

  • RNN中的参数U\V\W是共享的,并且在随机梯度下降算法中,每一步的输出不仅仅依赖当前步的网络,并且还需要前若干步网络的状态,那么这种BP改版的算法叫做Backpropagation Through Time(BPTT)
  • BPTT算法BP算法一样,在多层训练过程中(长时依赖<即当前的输出和前面很长的一段序列有关,一般超过10步>),可能产生梯度消失和梯度爆炸的问题。
  • BPTTBP算法思路一样,都是求偏导,区别在于需要考虑时间对step的影响
2、RNN 正向传播阶段

t = 1 t=1 t=1的时刻, U , V , W U,V,W U,V,W都被随机初始化好, s 0 s_0 s0通常初始化为0,然后进行如下计算:

  • h 1 = U x 1 + W s 0 h_1 = Ux_1+Ws_0 h1=Ux1+Ws0
  • s 1 = f ( h 1 ) s_1 = f(h_1) s1=f(h1)
  • o 1 = g ( V s 1 ) o_1 = g(Vs_1) o1=g(Vs1)

t = 2 t=2 t=2的时刻,,此时的状态 s 1 s_1 s1 作为时刻1的记忆状态将参与下一个时刻的预测活动,也就是:

  • h 2 = U x 2 + W s 1 h_2 = Ux_2+Ws_1 h2=Ux2+Ws1
  • s 2 = f ( h 2 ) s_2 = f(h_2) s2=f(h2)
  • o 2 = g ( V s 2 ) o_2=g(Vs_2) o2=g(Vs2)

以此类推,可得:

  • h t = U x t + W s t − 1 h_t = Ux_t+Ws_{t-1} ht=Uxt+Wst1
  • s t = f ( h t ) s_t = f(h_t) st=f(ht)
  • o t = g ( V s t ) o_t=g(Vs_t) ot=g(Vst)

其中 f f f 可以是 tanhrelusigmoid等激活函数, g g g 通常是 softmax 也可以是其他

  • 值得注意的是,我们说递归神经网络拥有记忆能力,而这种能力就是通过 W W W 将以往的输入状态进行总结,而作为下次输入的辅助
  • 可以这样理解隐藏状态: h = f ( 现 有 的 输 入 + 过 去 记 忆 总 结 ) h=f(现有的输入+过去记忆总结) h=f(+)
3、RNN反向传播阶段

   BP神经网络 用到的误差反向传播 方法将输出层的误差总和,对各个权重的梯度 ∇ U \nabla U U ∇ V \nabla V V ∇ W \nabla W W,求偏导数,然后利用梯度下降法更新各个权重。

  对于每一时刻 t t tRNN网络,网络的输出 o t o_t ot 都会产生一定误差 e t e_t et,误差的损失函数,可以是交叉熵也可以是平方误差等等。那么总的误差为 E = ∑ t e t E=\sum_t e_t E=tet,我们的目标就是要求取:

E = ∑ t e t E=\sum_t e_t E=tet

∇ U = ∂ E ∂ U = ∑ t ∂ e t ∂ U \nabla U = \frac{\partial E}{\partial U} = \sum_t\frac{\partial e_t}{\partial U} U=UE=tUet

∇ V = ∂ E ∂ V = ∑ t ∂ e t ∂ V \nabla V = \frac{\partial E}{\partial V} = \sum_t\frac{\partial e_t}{\partial V} V=VE=tVet

∇ W = ∂ E ∂ W = ∑ t ∂ e t ∂ W \nabla W = \frac{\partial E}{\partial W} = \sum_t\frac{\partial e_t}{\partial W} W=WE=tWet

下面我们以 t = 3 t=3 t=3 为例:

假设使用均方误差,且真实值为 y i y_i yi,那么:

e 3 = 1 2 ( o 3 − y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2 e3=21(o3y3)2

o 3 = g ( V s 3 ) o_3=g(Vs_3) o3=g(Vs3)

e 3 = 1 2 ( g ( V s 3 ) − y 3 ) 2 e_3=\frac{1}{2}(g(Vs_3)-y_3)^2 e3=21(g(Vs3)y3)2

s 3 = f ( U x 3 + W s 2 ) s_3=f(Ux_3+Ws_2) s3=f(Ux3+Ws2)

e 3 = 1 2 ( g ( V f ( U x 3 + W s 2 ) ) ) − y 3 ) 2 e_3=\frac{1}{2}(g(Vf(Ux_3+Ws_2)))-y_3)^2 e3=21(g(Vf(Ux3+Ws2)))y3)2

求解 W W W 的偏导数:

上式和 W W W 有关的是 W s 2 Ws_2 Ws2,很显然这是个复合函数

我们便可以根据复合函数的求导方式,链式法则:

∂ e 3 ∂ W = ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ W \frac{\partial e_3}{\partial W} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial W} We3=o3e3s3o3Ws3

下面便依次求解(如果使用均方差损失),那么:


e 3 = 1 2 ( o 3 − y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2 e3=21(o3y3)2

∂ e 3 ∂ o 3 = o 3 − y 3 \frac{\partial e_3}{\partial o_3} = o_3 - y_3 o3e3=o3y3


o 3 = g ( V s 3 ) o_3=g(Vs_3) o3=g(Vs3)

∂ o 3 ∂ s 3 = g ′ V \frac{\partial o_3}{\partial s_3}=g&#x27;V s3o3=gV

g ′ g&#x27; g 表示函数 g 的导数


前面两个比较简单,重要的是第三项:

根据公式 :

s t = f ( U x t + W s t − 1 ) s_t = f(Ux_t+Ws_{t-1}) st=f(Uxt+Wst1)

我们会发现, s 3 s_3 s3 除了和 W W W 有关之外,还和前一时刻 s 2 s_2 s2 有关

对于 s 3 s_3 s3 直接展开得到下面的式子:

∂ s 3 ∂ W = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W + ∂ s 3 ∂ s 2 ∂ s 2 ∂ W \frac{\partial s_3}{\partial W}=\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W} Ws3=s3s3Ws3++s2s3Ws2

  • 其中 ∂ s 3 + ∂ W \frac{\partial s_3^+}{\partial W} Ws3+表示不做复合求导,将W以外的都当做常量
  • ∂ s 2 ∂ W \frac{\partial s_2}{\partial W} Ws2 表示复合求导

对于 s 2 s_2 s2 直接展开得到下面的式子:

∂ s 2 ∂ W = ∂ s 2 ∂ s 2 ∂ s 2 + ∂ W + ∂ s 2 ∂ s 1 ∂ s 1 ∂ W \frac{\partial s_2}{\partial W}=\frac{\partial s_2}{\partial s_2}\frac{\partial s_2^+}{\partial W} + \frac{\partial s_2}{\partial s_1}\frac{\partial s_1}{\partial W} Ws2=s2s2Ws2++s1s2Ws1

对于 s 1 s_1 s1 直接展开得到下面的式子:

∂ s 1 ∂ W = ∂ s 1 ∂ s 1 ∂ s 1 + ∂ W + ∂ s 1 ∂ s 0 ∂ s 0 ∂ W \frac{\partial s_1}{\partial W}=\frac{\partial s_1}{\partial s_1}\frac{\partial s_1^+}{\partial W} + \frac{\partial s_1}{\partial s_0}\frac{\partial s_0}{\partial W} Ws1=s1s1Ws1++s0s1Ws0

将后两个展开的代入第一个得到:

∂ s 3 ∂ W = ∑ k = 0 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial s_3}{\partial W}=\sum_{k=0}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} Ws3=k=03sks3Wsk+

最终:

∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} We3=k=03o3e3s3o3sks3Wsk+

另外一种方式(假设我们不考虑 f f f):
s t = U x t + W s t − 1 s_t=Ux_t+Ws_{t-1} st=Uxt+Wst1

s 3 = U x 3 + W s 2 s_3=Ux_3+Ws_{2} s3=Ux3+Ws2

∂ s 3 ∂ W = s 2 + W ∂ s 2 ∂ W \frac{\partial s_3}{\partial W} = s_2+W\frac{\partial s_2}{\partial W} Ws3=s2+WWs2

= s 2 + W s 1 + W W ∂ s 1 ∂ W =s_2+Ws_1+WW\frac{\partial s_1}{\partial W} =s2+Ws1+WWWs1

  • s 2 = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W s_2 = \frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} s2=s3s3Ws3+
  • 其中, ∂ s 3 ∂ s 3 = 1 \frac{\partial s_3}{\partial s_3}=1 s3s3=1 ∂ s 3 + ∂ W = s 2 \frac{\partial s_3^+}{\partial W}=s_2 Ws3+=s2表示 s 3 s_3 s3 W W W求导,不做复合求导

s 2 = U x 2 + W s 1 s_2=Ux_2+Ws_{1} s2=Ux2+Ws1

  • W s 1 = ∂ s 3 ∂ s 2 ∂ s 2 + ∂ W Ws_1 =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W} Ws1=s2s3Ws2+
  • 其中, ∂ s 3 ∂ s 2 = W \frac{\partial s_3}{\partial s_2}=W s2s3=W ∂ s 2 + ∂ W = s 1 \frac{\partial s_2^+}{\partial W}=s_1 Ws2+=s1

s 1 = U x 1 + W s 0 s_1=Ux_1+Ws_{0} s1=Ux1+Ws0

W W ∂ s 1 ∂ W = ∂ s 3 ∂ s 2 ∂ s 2 ∂ s 1 ∂ s 1 + ∂ W = ∂ s 3 ∂ s 1 ∂ s 1 + ∂ W WW\frac{\partial s_1}{\partial W}=\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W} WWWs1=s2s3s1s2Ws1+=s1s3Ws1+
最终:
∂ s 3 ∂ W = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W + ∂ s 3 ∂ s 2 ∂ s 2 + ∂ W + ∂ s 3 ∂ s 1 ∂ s 1 + ∂ W = ∑ k = 1 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial s_3}{\partial W} =\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}+\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}+\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\sum_{k=1}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} Ws3=s3s3Ws3++s2s3Ws2++s1s3Ws1+=k=13sks3Wsk+

在这里插入图片描述

∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} We3=k=03o3e3s3o3sks3Wsk+

根据上图,链式法则:

∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ( ∏ j = k + 1 3 ∂ s j ∂ s j − 1 ) ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\Big(\prod_{j=k+1}^3\frac{\partial s_j}{\partial s_{j-1}}\Big)\frac{\partial s_k^+}{\partial W} We3=k=03o3e3s3o3(j=k+13sj1sj)Wsk+

求解 U U U 的偏导数:(和求 W W W类似)

∂ e 3 ∂ U = ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ U \frac{\partial e_3}{\partial U} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial U} Ue3=o3e3s3o3Us3

假设: a t = U x t , b t = W s t − 1 a_t = Ux_t,b_t=Ws_{t-1} at=Uxt,bt=Wst1

s t = f ( a t + b t ) s_t = f(a_t+b_t) st=f(at+bt)

求第三项,根据公式 :

s 3 = f ( U x 3 + W s 2 ) s_3 = f(Ux_3+Ws_{2}) s3=f(Ux3+Ws2)

∂ s 3 ∂ U = f ′ × ( ∂ U x 3 ∂ U + W ∂ s 2 ∂ U ) \frac{\partial s_3}{\partial U}=f&#x27; \times (\frac{\partial Ux_3}{\partial U}+W\frac{\partial s_2}{\partial U}) Us3=f×(UUx3+WUs2)

= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W ∂ s 1 ∂ U ) ) =f&#x27; \times (\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_2}{\partial U}+W\frac{\partial s_1}{\partial U})) =f×(UUx3+Wf×(UUx2+WUs1))

= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W f ′ × ( ∂ U x 1 ∂ U + W ∂ s 1 ∂ U ) ) ) =f&#x27; \times (\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_2}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_1}{\partial U}+W\frac{\partial s_1}{\partial U}))) =f×(UUx3+Wf×(UUx2+Wf×(UUx1+WUs1)))

= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W f ′ × ( ∂ U x 1 ∂ U + W f ′ × ( ∂ U x 0 ∂ U ) ) ) ) =f&#x27; \times \Bigg(\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times \bigg(\frac{\partial Ux_2}{\partial U}+Wf&#x27; \times \Big(\frac{\partial Ux_1}{\partial U}+Wf&#x27; \times \big(\frac{\partial Ux_0}{\partial U}\big)\Big)\bigg)\Bigg) =f×(UUx3+Wf×(UUx2+Wf×(UUx1+Wf×(UUx0))))

= f ′ × ∂ U x 3 ∂ U + W ( f ′ ) 2 × ∂ U x 2 ∂ U + W 2 ( f ′ ) 3 × ∂ U x 1 ∂ U + W 3 ( f ′ ) 4 × ( ∂ U x 0 ∂ U ) =f&#x27; \times \frac{\partial Ux_3}{\partial U}+W(f&#x27;)^2 \times \frac{\partial Ux_2}{\partial U}+W^2(f&#x27;)^3 \times \frac{\partial Ux_1}{\partial U}+W^3(f&#x27;)^4 \times \big(\frac{\partial Ux_0}{\partial U}\big) =f×UUx3+W(f)2×UUx2+W2(f)3×UUx1+W3(f)4×(UUx0)

= ∑ k = 0 3 ( f ′ ) 4 − k ∂ ( W 3 − k a k ) ∂ U =\sum_{k=0}^3 (f&#x27;)^{4-k}\frac{\partial (W^{3-k}a_k)}{\partial U} =k=03(f)4kU(W3kak)

∂ e 3 ∂ U = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ ( W 3 − k a k ) ∂ U ( f ′ ) 4 − k \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}(f&#x27;)^{4-k} Ue3=k=03o3e3s3o3U(W3kak)(f)4k

这里的结果我也不知道对不对,希望了解的朋友指导下,非常感谢。

不考虑 f f f
s t = U x t + W s t − 1 s_t=Ux_t+Ws_{t-1} st=Uxt+Wst1
s 3 = U x 3 + W ( U x 2 + W ( U x 1 + W U x 0 ) ) s_3=Ux_3+W\Big(Ux_2+W\big(Ux_1+WUx_0\big)\Big) s3=Ux3+W(Ux2+W(Ux1+WUx0))
= U x 3 + W U x 2 + W 2 U x 1 + W 3 U x 0 =Ux_3+WUx_2+W^2Ux_1+W^3Ux_0 =Ux3+WUx2+W2Ux1+W3Ux0
s 3 = a 3 + W a 2 + W 2 a 1 + W 3 a 0 s_3 = a_3+Wa_2+W^2a_1+W^3a_0 s3=a3+Wa2+W2a1+W3a0
∂ s 3 ∂ U = ∑ k = 0 3 ∂ ( W 3 − k a k ) ∂ U \frac{\partial s_3}{\partial U} =\sum_{k=0}^3 \frac{\partial (W^{3-k}a_k)}{\partial U} Us3=k=03U(W3kak)
∂ e 3 ∂ U = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ ( W 3 − k a k ) ∂ U \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U} Ue3=k=03o3e3s3o3U(W3kak)

求解 V V V 的偏导数:

因为 V V V 只和输出 o t o_t ot有关有关,所以:

∂ e 3 ∂ V = ∂ e 3 ∂ o 3 ∂ o 3 ∂ V \frac{\partial e_3}{\partial V} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial V} Ve3=o3e3Vo3

4、RNN缺陷

  从我们上面的推导过程,假如 t = 0 t=0 t=0时刻的值,到 t = 100 t=100 t=100 时,由于前面的 W W W 次数过大,又可能会使其忘记 t = 0 t=0 t=0时刻的信息,我们称之为RNN梯度消失,但是不是真正意思上的消失,因为梯度是累加的过程,不可能为0,只是在某个时刻的梯度太小,忘记了前面时刻的内容。

  为了克服梯度消失的问题,LSTM和GRU模型便后续被推出了。由于它们都有特殊的方式存储“记忆”,那么以前梯度比较大的“记忆”不会像简单的RNN一样马上被抹除,因此可以一定程度
上克服梯度消失问题。

另一个简单的技巧可以用来克服梯度爆炸的问题就是gradient clipping,也就是当你计算的梯度超过阈值c的或者小于阈值−c时候,便把此时的梯度设置成c或−c。

下图所示是RNN的误差平面:
在这里插入图片描述

上图可以看到RNN的误差平面要么非常陡峭,要么非常平坦,如果不采取任何措施,当你的参数在某一次更新之后,刚好碰到陡峭的地方,此时梯度变得非常大,那么你的参数更新也会非常大,很容易导致震荡问题。而如果你采取了gradient clipping这个技巧,那么即使你不幸碰到陡峭的地方,梯度也不会爆炸,因为梯度被限制在某个阈值c。

二、LSTM(长短期记忆网络)

  由于在RNN中,存在长期依赖的问题,可能产生梯度消失和梯度爆炸的问题。而LSTM从名字就可以看出它特别适合解决这类需要长时间依赖的问题,相比于RNN

  • LSTM 的“记忆细胞(Cell)”改造了
  • 该记录的信息会一直传递下去,不该记录的信息会被截断

下图是循环网络的展开结构:
在这里插入图片描述

其中的 A 部分的框便表示“记忆细胞”

RNN 的“记忆细胞” 如下:

在这里插入图片描述

只是通过简单的非线性映射

LSTM 的“记忆细胞” 如下:

在这里插入图片描述

增加了三个门,来控制“记忆细胞”

1、记忆细胞

  细胞状态类似于传送带,直接在整个链上运行,只有一些少量的线性交互,信息在上面流传保持不变很容易。

在这里插入图片描述

LSTM 怎么控制“细胞状态”?

  • LSTM 可以通过 gates(“门”) 结构来去除或者增加“细胞状态”的信息
  • LSTM 中主要有三个“门”结构来控制“细胞状态”
  • 忘记门、信息增加门、输出门
2、忘记门

在这里插入图片描述

  • 将上一时间点的输出和该时刻的输入进行 sigmoid 操作,输出一个0到1之间的概率值
  • 该概率值描述了,每个部分有多少量可以通过
  • 如果该值为0,那么与 C t − 1 C_{t-1} Ct1经过乘法操作后依然为0,表示“不允许任何变量通过”
  • 如果该值为1,那么与 C t − 1 C_{t-1} Ct1经过乘法操作后依然为 C t − 1 C_{t-1} Ct1,”表示“允许所有变量通过”

“忘记门”:决定从“细胞状态”中丢弃什么信息;
比如在语言模型中,细胞状态可能包含了性别信息(“他”或者“她”),当我们看到新的代名词的时候,可以考虑忘记旧的数据

3、信息增强门

在这里插入图片描述

  • 决定放什么新信息到“细胞状态”中;
  • Sigmoid层 决定什么值需要更新;
  • Tanh层 创建一个新的候选向量 C ~ t \widetilde{C}_t C t,主要是为了状态更新做准备

在这里插入图片描述
经过忘记门信息增加门后,可以确定传递信息的删除增加,即可以进行“细胞状态”的更新

  • 更新 C t − 1 C_{t-1} Ct1 C t C_t Ct
  • 将旧状态与 f t f_t ft 相乘,丢失掉确定不要的信息
  • 加上新的候选值 i t ∗ C t i_t*C_t itCt 得到最终更新后的“细胞状态”
4、输出门

在这里插入图片描述
输出门是基于“细胞状态”得到输出:

  • 首先运行一个sigmoid层来确定细胞状态的那个部分将输出
  • 使用 tanh 处理细胞状态得到一个-1到1之间的值,再将它和sigmoid门的输出相乘,输出程序确定输出的部分。
5、LSTM 正向传播

f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f ) f_t = \sigma(W_f \cdot[h_{t-1}, x_t] + b_f) ft=σ(Wf[ht1,xt]+bf)

[ h t − 1 , x t ] [h_{t-1}, x_t] [ht1,xt] x f x_f xf

i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) ) i_t = \sigma(W_i \cdot[h_{t-1}, x_t] + b_i)) it=σ(Wi[ht1,xt]+bi))

[ h t − 1 , x t ] [h_{t-1}, x_t] [ht1,xt] x i x_i xi

C ~ t = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C ) \widetilde{C}_t = tanh(W_C \cdot [h_{t-1},x_t]+b_C) C t=tanh(WC[ht1,xt]+bC)

[ h t − 1 , x t ] [h_{t-1}, x_t] [ht1,xt] x C x_C xC

C t = f t ∗ C t − 1 + i t ∗ C ~ t C_t = f_t * C_{t-1} + i_t * \widetilde{C}_t Ct=ftCt1+itC t

o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o ) o_t=\sigma(W_o\cdot [h_{t-1}, x_t] + b_o) ot=σ(Wo[ht1,xt]+bo)

[ h t − 1 , x t ] [h_{t-1}, x_t] [ht1,xt] x o x_o xo

h t = o t ∗ t a n h ( C t ) h_t=o_t * tanh(C_t) ht=ottanh(Ct)

y ^ t = W y ⋅ h t + b y \hat{y}_t=W_y \cdot h_t + b_y y^t=Wyht+by

6、LSTM 反向传播

使用均方误差:

E = ∑ t = 0 T E t E = \sum_{t=0}^T E_t E=t=0TEt

E t = 1 2 ( y ^ t − y t ) 2 E_t = \frac{1}{2} (\hat{y}_t - y_t)^2 Et=21(y^tyt)2

∂ E ∂ W y = ∑ t = 0 T ∂ E t ∂ W y = ∑ t = 0 T ∂ E t ∂ y ^ t ∂ y ^ t ∂ W y = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ h t \frac{\partial E}{\partial W_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot h_t WyE=t=0TWyEt=t=0Ty^tEtWyy^t=t=0Ty^tEtht

∂ E ∂ b y = ∑ t = 0 T ∂ E t ∂ b y = ∑ t = 0 T ∂ E t ∂ y ^ t ∂ y ^ t ∂ b y = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ 1 \frac{\partial E}{\partial b_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot 1 byE=t=0TbyEt=t=0Ty^tEtbyy^t=t=0Ty^tEt1

因为 W f , W i , W C , W o W_f,W_i,W_C,W_o WfWiWCWo 均和 h t h_t ht C t C_t Ct 有关系,所以求导法则均可写为关于 h t h_t ht C t C_t Ct的链式法则

(1)先求 E E E 关于 h t h_t ht C t C_t Ct 的导数

在这里插入图片描述

上图中可知, h t h_t ht C t C_t Ct 都有两条链路,因此导数包含两个部分

  • 一个是当前时刻误差的导数
  • 另一个是下一时刻到 T T T 时刻的所有误差累积的导数

∂ E ∂ h t = ∂ E t ∂ h t + ∂ ( ∑ k = t + 1 T E k ) ∂ h t \frac{\partial E}{\partial h_t} =\frac{\partial E_t}{\partial h_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t} htE=htEt+ht(k=t+1TEk)

∂ E ∂ C t = ∂ E t ∂ C t + ∂ ( ∑ k = t + 1 T E k ) ∂ C t \frac{\partial E}{\partial C_t} =\frac{\partial E_t}{\partial C_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t} CtE=CtEt+Ct(k=t+1TEk)

∂ E t ∂ h t = ∂ E t ∂ y ^ t ∂ y ^ t ∂ h t = ∂ E t ∂ y ^ t ⋅ W y T \frac{\partial E_t}{\partial h_t} =\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial h_t}=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T htEt=y^tEthty^t=y^tEtWyT

∂ E t ∂ C t = ∂ E t ∂ h t ∂ h t ∂ C t = ∂ E t ∂ h t ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) = ∂ E t ∂ y ^ t ⋅ W y T ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) \frac{\partial E_t}{\partial C_t}=\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial C_t}= \frac{\partial E_t}{\partial h_t} \cdot o_t \cdot (1-tanh^2(C_t))=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot o_t \cdot (1-tanh^2(C_t)) CtEt=htEtCtht=htEtot(1tanh2(Ct))=y^tEtWyTot(1tanh2(Ct))

以下两个现在求不出来,先用一个记号命名下:

∂ ( ∑ k = t + 1 T E k ) ∂ h t = d h n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}=dh_{next} ht(k=t+1TEk)=dhnext

∂ ( ∑ k = t + 1 T E k ) ∂ C t = d C n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}=dC_{next} Ct(k=t+1TEk)=dCnext

(2)求 W o W_o Wo 的偏导

∂ E ∂ W o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ W o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ W o \frac{\partial E}{\partial W_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial W_o} WoE=t=0ThtEtWoht=t=0ThtEtothtWoot

∂ h t ∂ o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t}=tanh(C_t) otht=tanh(Ct)

∂ o t ∂ W o = o t ⋅ ( 1 − o t ) ⋅ x o T \frac{\partial o_t}{\partial W_o} = o_t \cdot (1-o_t) \cdot x_o^T Woot=ot(1ot)xoT

∂ E ∂ W o = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) ⋅ x o T \frac{\partial E}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) \cdot x_o^T WoE=t=0Ty^tEtWyTtanh(Ct)ot(1ot)xoT

(3)求 b o b_o bo 的偏导

∂ E ∂ b o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ h o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ b o \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial h_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial b_o} boE=t=0ThtEthoht=t=0ThtEtothtboot

∂ h t ∂ o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t} = tanh(C_t) otht=tanh(Ct)

∂ o t ∂ b o = o t ( 1 − o t ) \frac{\partial o_t}{\partial b_o}=o_t(1-o_t) boot=ot(1ot)

∂ E ∂ b o = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) boE=t=0Ty^tEtWyTtanh(Ct)ot(1ot)

(4)求 x o x_o xo 的偏导

∂ E ∂ x o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ x o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ x o \frac{\partial E}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial x_o} xoE=t=0ThtEtxoht=t=0ThtEtothtxoot

∂ o t ∂ x o = o t ( 1 − o t ) ⋅ W o T \frac{\partial o_t}{\partial x_o}=o_t(1-o_t)\cdot W_o^T xoot=ot(1ot)WoT

(5)求 W C W_C WC 的偏导

∂ E ∂ W C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ W C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ W C \frac{\partial E}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial W_C} WCE=t=0TCtEtWCCt=t=0TCtEtC tCtWCC t

∂ C t ∂ C ~ t = i t \frac{\partial C_t}{\partial \widetilde{C}_t}=i_t C tCt=it

∂ C ~ t ∂ W C = ( 1 − C ~ t 2 ) ⋅ x C T \frac{\partial \widetilde{C}_t}{\partial W_C}=(1-\widetilde{C}_t^2)\cdot x_C^T WCC t=(1C t2)xCT

(6)求 b C b_C bC 的偏导

∂ E ∂ b C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ b C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ b C \frac{\partial E}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial b_C} bCE=t=0TCtEtbCCt=t=0TCtEtC tCtbCC t

∂ C ~ t ∂ b C = ( 1 − C ~ t 2 ) ⋅ 1 \frac{\partial \widetilde{C}_t}{\partial b_C}=(1-\widetilde{C}_t^2)\cdot 1 bCC t=(1C t2)1

(7)求 x C x_C xC 的偏导

∂ E ∂ x C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ x C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ x C \frac{\partial E}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial x_C} xCE=t=0TCtEtxCCt=t=0TCtEtC tCtxCC t

∂ C ~ t ∂ x C = ( 1 − C ~ t 2 ) ⋅ W C T \frac{\partial \widetilde{C}_t}{\partial x_C}=(1-\widetilde{C}_t^2)\cdot W_C^T xCC t=(1C t2)WCT

(8)求 W i , b i , x i W_i,b_i,x_i Wibixi的偏导

∂ E ∂ W i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ W i \frac{\partial E}{\partial W_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial W_i} WiE=t=0TCtEtitCtWiit

∂ E ∂ b i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ b i \frac{\partial E}{\partial b_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial b_i} biE=t=0TCtEtitCtbiit

∂ E ∂ x i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ x i \frac{\partial E}{\partial x_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial x_i} xiE=t=0TCtEtitCtxiit

∂ C t ∂ i t = C ~ t \frac{\partial C_t}{\partial i_t}=\widetilde{C}_t itCt=C t

∂ i t ∂ W i = i t ⋅ ( 1 − i t ) ⋅ x i T \frac{\partial i_t}{\partial W_i}=i_t\cdot (1-i_t) \cdot x_i^T Wiit=it(1it)xiT

∂ i t ∂ b i = i t ⋅ ( 1 − i t ) ⋅ 1 \frac{\partial i_t}{\partial b_i}= i_t\cdot (1-i_t) \cdot 1 biit=it(1it)1

∂ i t ∂ x i = i t ⋅ ( 1 − i t ) ⋅ W i T \frac{\partial i_t}{\partial x_i}=i_t\cdot (1-i_t) \cdot W_i^T xiit=it(1it)WiT

(9)求 W f , b f , x f W_f,b_f,x_f Wfbfxf的偏导

∂ E ∂ W f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ W f \frac{\partial E}{\partial W_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial W_f} WfE=t=0TCtEtftCtWfft

∂ E ∂ b f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ b f \frac{\partial E}{\partial b_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial b_f} bfE=t=0TCtEtftCtbfft

∂ E ∂ x f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ x f \frac{\partial E}{\partial x_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial x_f} xfE=t=0TCtEtftCtxfft

∂ C t ∂ f t = C t − 1 \frac{\partial C_t}{\partial f_t}=C_{t-1} ftCt=Ct1

∂ f t ∂ W f = f t ⋅ ( 1 − f t ) ⋅ x f T \frac{\partial f_t}{\partial W_f}=f_t\cdot (1-f_t) \cdot x_f^T Wfft=ft(1ft)xfT

∂ f t ∂ b f = f t ⋅ ( 1 − f t ) ⋅ 1 \frac{\partial f_t}{\partial b_f}= f_t\cdot (1-f_t)\cdot 1 bfft=ft(1ft)1

∂ f t ∂ x f = f t ⋅ ( 1 − f t ) ⋅ W i T \frac{\partial f_t}{\partial x_f}=f_t\cdot (1-f_t)\cdot W_i^T xfft=ft(1ft)WiT

(10)求 X X X 的偏导

∂ E ∂ X = ∂ E ∂ x i + ∂ E ∂ x f + ∂ E ∂ x o + ∂ E ∂ x C \frac{\partial E}{\partial X}=\frac{\partial E}{\partial x_i}+\frac{\partial E}{\partial x_f}+\frac{\partial E}{\partial x_o}+\frac{\partial E}{\partial x_C} XE=xiE+xfE+xoE+xCE

X = [ h t − 1 , x ] X=[h_{t-1},x] X=[ht1,x] 可知 X X X [ h t − 1 , x ] [h_{t-1},x] [ht1,x] 的组合,故:

d h n e x t = ∂ E ∂ X [ : , : H ] ( 前 H 列 ) dh_{next}=\frac{\partial E}{\partial X}[:,:H](前H列) dhnext=XE[:,:H](H)

(11) d C n e x t dC_{next} dCnext

∂ ( ∑ k = t T E k ) ∂ C t − 1 = ∂ ( ∑ k = t T E k ) ∂ C t ⋅ ∂ C t ∂ C t − 1 = ∂ E ∂ C t ∂ C t ∂ C t − 1 = ∂ E ∂ C t ⋅ f t \frac{\partial (\sum_{k=t}^TE_k)}{\partial C_{t-1}}=\frac{\partial (\sum_{k=t}^TE_k)}{\partial C_{t}}\cdot \frac{\partial C_t}{\partial C_{t-1}}=\frac{\partial E}{\partial C_t}\frac{\partial C_t}{\partial C_{t-1}}=\frac{\partial E}{\partial C_t}\cdot f_t Ct1(k=tTEk)=Ct(k=tTEk)Ct1Ct=CtECt1Ct=CtEft

从后往前更新,故:

最后一个时刻的 d h n e x t = 0 dh_{next}=0 dhnext=0 d C n e x t = 0 dC_{next}=0 dCnext=0

三、LSTM 变种

1、变种1

在这里插入图片描述

  • 让每个门层都接受细胞状态的输入
2、变种2

在这里插入图片描述

  • 通过耦合忘记门和更新输入门(第一个和第二个门);也就是不再单独的考虑忘记什么、增加什么信息,而是一起进行考虑
3、变种3(GRU

在这里插入图片描述

  • GRU,2014年提出
  • 将忘记门和输入门合并成为一个单一的更新门
  • 同时合并了数据单元状态和隐藏状态
  • 结构比LSTM的结构更加简单

四、总结

  • 介绍了RNNLSTM的网络结构以及正向-方向传播的公式推导,看上去很复杂,尤其那么多数学公式,其实核心就是求导的链式法则,将变量之间的关联搞清楚,一步一步的去解决好像容易多了,博主也在学习,有些地方目前也不是完全懂,其中难免存在不足,希望大家多多指教。
  • 实际应用的都有现成的库帮我们实现,但是很好的理解原理也是提高应用效率的一种有效方法
  • 后面提到了LSTM 几种变种,当然还有其他的,这里并没有详细介绍,以后学习了在介绍
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值