深度学习笔记之循环神经网络(七)反向传播角度观察LSTM

引言

上一节介绍了循环神经网络反向传播中存在的梯度消失问题,并以此为引介绍了长短期记忆神经网络 ( Long-Short Term Memory,LSTM ) (\text{Long-Short Term Memory,LSTM}) (Long-Short Term Memory,LSTM)。本节将从反向传播角度观察为什么 LSTM \text{LSTM} LSTM能够抑制梯度消失的情况。

回顾加补充:通过时间反向传播

回顾上一节针对 RNN \text{RNN} RNN ∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}}\end{aligned} Wx(1)h(1)L(T)的反向传播过程
RNN反向传播过程示例
∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) = ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { ∏ k = 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] } ⋅ { ∏ k = 2 T W h ( k − 1 ) ⇒ h ( k ) } ⋅ x ( 1 ) \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(1)} Wx(1)h(1)L(T)=(O(T)y(T))Wh(T)O(T){k=1TDiag[1Tanh2(Z1(k))]}{k=2TWh(k1)h(k)}x(1)
如果仅仅是描述 T \mathcal T T时刻损失结果 L ( T ) \mathcal L^{(\mathcal T)} L(T)权重分量 W x ( 1 ) ⇒ h ( 1 ) \mathcal W_{x^{(1)} \Rightarrow h^{(1)}} Wx(1)h(1)梯度信息(红色路径),使用上述公式即可;但实际上,该网络层是一个循环过程,我们需要求解 L ( T ) \mathcal L^{(\mathcal T)} L(T)对整个权重 W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WXH的梯度进行求解。

这意味着:每更新到一个时刻 t ( t = 1 , 2 , ⋯   , T ) t(t=1,2,\cdots,\mathcal T) t(t=1,2,,T),都会将当前时刻 W x ( t ) ⇒ h ( t ) \mathcal W_{x^{(t)}\Rightarrow h^{(t)}} Wx(t)h(t)的梯度累加在 W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WXH的梯度中。因而关于 L ( T ) \mathcal L^{(\mathcal T)} L(T) W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WXH梯度 ∂ L ( T ) ∂ W X ⇒ H \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned} WXHL(T)表示为如下形式:
将上式 1 ⇒ t 1 \Rightarrow t 1t代入。
下面公式中最后一项的大括号不是矩阵,而是为方便表达,描述 T \mathcal T T项累加和的过程。
∂ L ( T ) ∂ W X ⇒ H = ∂ L ( T ) ∂ W x ( T ) ⇒ h ( T ) + ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ h ( T − 1 ) + ⋯ + ∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) = ∑ t = 1 T ∂ L ( t ) ∂ W x ( t ) ⇒ h ( t ) = ∑ t = 1 T [ ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { ∏ k = t T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] } ⋅ { ∏ k = t + 1 T W h ( k − 1 ) ⇒ h ( k ) } ⋅ x ( t ) ] = ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { Diag [ 1 − Tanh 2 ( Z 1 ( T ) ) ] ⋅ x ( T ) + ∏ k = T − 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] ⋅ W h ( T − 1 ) ⇒ h ( T ) ⋅ x ( T − 1 ) ⋮ + ∏ k = 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] ⋅ ∏ k = 2 T W h ( k − 1 ) ⇒ h ( k ) ⋅ x ( 1 ) } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow h^{(\mathcal T)}}} + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-1)} \Rightarrow h^{(\mathcal T-1)}}} + \cdots + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} \\ & = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(t)}}{\partial \mathcal W_{x^{(t)} \Rightarrow h^{(t)}}} \\ & = \sum_{t=1}^{\mathcal T} \left[(\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=t}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=t+1}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(t)}\right] \\ & = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \begin{Bmatrix} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(\mathcal T)})\right] \cdot x^{(\mathcal T)} \\ +\prod_{k=\mathcal T-1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow h^{(\mathcal T)}} \cdot x^{(\mathcal T - 1)} \\ \vdots \\ +\prod_{k=1}^{\mathcal T} \text{Diag}\left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}} \cdot x^{(1)} \end{Bmatrix} \end{aligned} WXHL(T)=Wx(T)h(T)L(T)+Wx(T1)h(T1)L(T)++Wx(1)h(1)L(T)=t=1TWx(t)h(t)L(t)=t=1T[(O(T)y(T))Wh(T)O(T){k=tTDiag[1Tanh2(Z1(k))]}{k=t+1TWh(k1)h(k)}x(t)]=(O(T)y(T))Wh(T)O(T) Diag[1Tanh2(Z1(T))]x(T)+k=T1TDiag[1Tanh2(Z1(k))]Wh(T1)h(T)x(T1)+k=1TDiag[1Tanh2(Z1(k))]k=2TWh(k1)h(k)x(1)
很明显,上述公式中大括号内共包含 T \mathcal T T项的累加结果。可以通过观察发现:越后面的累加项,梯度消失的越厉害。也就是说:梯度 ∂ L ( T ) ∂ W X ⇒ H \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned} WXHL(T)的结果,主要贡献来源于其反向传播最开始的若干个时刻。而这种计算代价为 O ( T ) \mathcal O(\mathcal T) O(T)的反向传播算法也被称作通过时间反向传播 ( Back-Propagation Through Time,BPTT ) (\text{Back-Propagation Through Time,BPTT}) (Back-Propagation Through Time,BPTT)

LSTM \text{LSTM} LSTM的反向传播过程

场景构建

关于 t t t时刻 LSTM \text{LSTM} LSTM前馈计算过程表示如下:
这里 y ( t ) y^{(t)} y(t)表示当前时刻的最终输出,后与对应时刻的损失函数结果 L ( t ) \mathcal L^{(t)} L(t)相衔接。
f ( t ) = σ [ W x ( t ) ⇒ f ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ h ( t ) ⋅ h ( t − 1 ) + b f ] i ( t ) = σ [ W x ( t ) ⇒ i ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ i ( t ) ⋅ h ( t − 1 ) + b i ] O ( t ) = σ [ W x ( t ) ⇒ O ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ O ( t ) ⋅ h ( t − 1 ) + b O ] C ~ ( t ) = Tanh [ W x ( t ) ⇒ C ~ ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ C ~ ( t ) ⋅ h ( t − 1 ) + b C ~ ] C ( t ) = C ( t − 1 ) ∗ f ( t ) + C ~ ( t ) ∗ i ( t ) h ( t ) = O ( t ) ∗ Tanh ( C ( t ) ) y ( t ) = W h ( t ) ⇒ y ( t ) ⋅ h ( t ) + b y W h ( t ) ⇒ y ( t ) ⇒ W H ⇒ Y \begin{aligned} & f^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)}\Rightarrow h^{(t)}} \cdot h^{(t-1)} + b_f\right] \\ & i^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \cdot h^{(t-1)} + b_i\right] \\ & {\mathcal O}^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \cdot h^{(t-1)} + b_{\mathcal O}\right] \\ & \widetilde{\mathcal C}^{(t)} = \text{Tanh} \left[\mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot h^{(t-1)} + b_{\widetilde{\mathcal C}}\right] \\ & \mathcal C^{(t)} = \mathcal C^{(t-1)} * f^{(t)} + \widetilde{\mathcal C}^{(t)} * i^{(t)} \\ & h^{(t)} = {\mathcal O}^{(t)} * \text{Tanh}(\mathcal C^{(t)}) \\ & y^{(t)} = \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \cdot h^{(t)} + b_y \quad \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal Y} \end{aligned} f(t)=σ[Wx(t)f(t)x(t)+Wh(t1)h(t)h(t1)+bf]i(t)=σ[Wx(t)i(t)x(t)+Wh(t1)i(t)h(t1)+bi]O(t)=σ[Wx(t)O(t)x(t)+Wh(t1)O(t)h(t1)+bO]C (t)=Tanh[Wx(t)C (t)x(t)+Wh(t1)C (t)h(t1)+bC ]C(t)=C(t1)f(t)+C (t)i(t)h(t)=O(t)Tanh(C(t))y(t)=Wh(t)y(t)h(t)+byWh(t)y(t)WHY
和上面的 BPTT \text{BPTT} BPTT思路相同,上述公式中的权重参数 W x ( t ) ⇒ f ( t ) , W h ( t − 1 ) ⇒ i ( t ) \mathcal W_{x^{(t)} \Rightarrow f^{(t)}},\mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} Wx(t)f(t),Wh(t1)i(t)等等,它们均是某时刻 t t t关于输入变量 x ( t ) x^{(t)} x(t)或者上一时刻隐变量 h ( t − 1 ) h^{(t-1)} h(t1)在各个门结构中的权重信息。在反向传播过程中:每一时刻的梯度均会存放在对应的权重参数。我们将其对应设定为:
W X ⇒ : { W x ( t ) ⇒ f ( t ) ⇒ W X ⇒ F W x ( t ) ⇒ i ( t ) ⇒ W X ⇒ I W x ( t ) ⇒ O ( t ) ⇒ W X ⇒ O W x ( t ) ⇒ C ~ ( t ) ⇒ W X ⇒ C ~ W H ⇒ : { W h ( t − 1 ) ⇒ f ( t ) ⇒ W H ⇒ F W h ( t − 1 ) ⇒ i ( t ) ⇒ W H ⇒ I W h ( t − 1 ) ⇒ O ( t ) ⇒ W H ⇒ O W h ( t − 1 ) ⇒ C ~ ( t ) ⇒ W H ⇒ C ~ \begin{aligned} & \mathcal W_{\mathcal X \Rightarrow}: \begin{cases} \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal F} \quad \mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal I} \\ \mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal O} \quad \mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal C}} \end{cases} \\ & \mathcal W_{\mathcal H \Rightarrow}: \begin{cases} \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal F} \quad \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal I} \\ \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal O} \quad \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal C}} \end{cases} \end{aligned} WX:{Wx(t)f(t)WXFWx(t)i(t)WXIWx(t)O(t)WXOWx(t)C (t)WXC WH:{Wh(t1)f(t)WHFWh(t1)i(t)WHIWh(t1)O(t)WHOWh(t1)C (t)WHC

示例:求解梯度 L ( T ) ∂ W X ⇒ F \begin{aligned}\frac{\mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned} WXFL(T)

假设序列长度 T \mathcal T T,并且 T \mathcal T T时刻输出的损失结果 L ( T ) \mathcal L^{(\mathcal T)} L(T),我们想要求解 L ( T ) \mathcal L^{(\mathcal T)} L(T)权重矩阵 W X ⇒ F \mathcal W_{\mathcal X \Rightarrow \mathcal F} WXF的梯度结果 ∂ L ( T ) ∂ W X ⇒ F \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned} WXFL(T)
和‘循环神经网络’逻辑相同,描述各时刻的梯度累加。
∂ L ( T ) ∂ W X ⇒ F = ∑ t = 1 T ∂ L ( T ) ∂ W x ( t ) ⇒ f ( t ) \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}} = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(t)} \Rightarrow f^{(t)}}} WXFL(T)=t=1TWx(t)f(t)L(T)

反向传播过程 T \mathcal T T时刻梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned} Wx(T)f(T)L(T) 求解

这里先观察最后一个时刻的梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned} Wx(T)f(T)L(T)

  • 它的梯度传播路径可表示为:
    { f ~ ( T ) = W x ( T ) ⇒ f ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ f ( T ) ⋅ h ( T − 1 ) + b f ⏟ 梯度无关 f ( T ) = Sigmoid ( f ~ ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) ⏟ 梯度无关 m ( T ) = Tanh ( C ( T ) ) h ( T ) = O ( T ) ∗ m ( T ) y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y \begin{aligned} \begin{cases} & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \underbrace{\mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f}_{梯度无关} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \underbrace{\widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)}}_{梯度无关} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \end{cases} \end{aligned} f (T)=Wx(T)f(T)x(T)+梯度无关 Wh(T1)f(T)h(T1)+bff(T)=Sigmoid(f (T))C(T)=C(T1)f(T)+梯度无关 C (T)i(T)m(T)=Tanh(C(T))h(T)=O(T)m(T)y(T)=Wh(T)y(T)h(T)+by
  • 对应传播路径图像表示为(红色箭头路径):
    梯度传播路径
    因此,梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned}\end{aligned} Wx(T)f(T)L(T)可表示为:
    ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) = ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ ∂ h ( T ) ∂ m ( T ) ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial m^{(\mathcal T)}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)}\Rightarrow f^{(\mathcal T)}}} \end{aligned} Wx(T)f(T)L(T)=y(T)L(T)h(T)y(T)m(T)h(T)C(T)m(T)f(T)C(T)f (T)f(T)Wx(T)f(T)f (T)
反向传播过程 T − 1 \mathcal T - 1 T1时刻梯度 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T -1)}}}\end{aligned} Wx(T1)f(T1)L(T)求解

至此,距离 T \mathcal T T时刻最近的,关于 W X ⇒ F \mathcal W_{\mathcal X \Rightarrow \mathcal F} WXF的梯度信息已经求解出来。那么 T − 1 \mathcal T-1 T1时刻呢?和 T \mathcal T T时刻相比有什么区别呢 ? ? ?损失结果 L ( T ) \mathcal L^{(\mathcal T)} L(T)关于 T − 1 \mathcal T-1 T1时刻 W x ( T − 1 ) ⇒ f ( T − 1 ) \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T1)f(T1)梯度结果 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} \end{aligned} Wx(T1)f(T1)L(T)进行表示:
关于 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T1)f(T1)的梯度必然会经过 T \mathcal T T时刻,它的反向传播过程包含几类路径:

  • 输出门 h ( T ) h^{(\mathcal T)} h(T)直接反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1),再从 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1)反向传播至 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T1)f(T1)。它的梯度传播路径可表示为:
    省略号部分的与 T \mathcal T T时刻相同,仅需将对应的 T \mathcal T T改成 T − 1 \mathcal T-1 T1即可,下面省略号同理。
    { y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) O ( T ) = σ [ W x ( T ) ⇒ O ( T ) ⋅ x ( T ) + b O ⏟ h ( T − 1 ) 无关 + W h ( T − 1 ) ⇒ O ( T ) ⋅ h ( T − 1 ) ] h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & {\mathcal O}^{(\mathcal T)} = \sigma \left[\underbrace{\mathcal W_{x^{(\mathcal T)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + b_{\mathcal O}}_{h^{(\mathcal T-1)}无关} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)}\right] \\ & h^{(\mathcal T - 1)} = {\mathcal O}^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} y(T)=Wh(T)y(T)h(T)+byh(T)=O(T)m(T)O(T)=σ h(T1)无关 Wx(T)O(T)x(T)+bO+Wh(T1)O(T)h(T1) h(T1)=O(T1)m(T1)
    因此,该路径的梯度可表示为:
    ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ O ( T ) ⋅ ∂ O ( T ) ∂ h ( T − 1 ) ⋅ ∂ h ( T − 1 ) ∂ m ( T − 1 ) ⋅ ∂ m ( T − 1 ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right] y(T)L(T)h(T)y(T)[O(T)h(T)h(T1)O(T)m(T1)h(T1)C(T1)m(T1)f(T1)C(T1)f (T1)f(T1)Wx(T1)f(T1)f (T1)]
  • 细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)角度反向传播 C ( T − 1 ) \mathcal C^{(\mathcal T -1)} C(T1),再从 C ( T − 1 ) \mathcal C^{(\mathcal T - 1)} C(T1)反向传播至 f ( T − 1 ) f^{(\mathcal T-1)} f(T1),直至 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T1)f(T1)。它的梯度传播路径可表示为:
    { y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) C ( T − 1 ) = C ( T − 2 ) ∗ f ( T − 1 ) + C ~ ( T − 1 ) ∗ i ( T − 1 ) f ( T − 1 ) = Sigmoid ( f ~ ( T − 1 ) ) f ~ ( T − 1 ) = W x ( T − 1 ) ⇒ f ( T − 1 ) ⋅ x ( T − 1 ) + W h ( T − 2 ) ⇒ f ( T − 1 ) ⋅ h ( T − 2 ) + b f ⏟ 梯度无关 \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \mathcal C^{(\mathcal T-1)} = \mathcal C^{(\mathcal T -2)} * f^{(\mathcal T-1)} + \widetilde{\mathcal C}^{(\mathcal T-1)} * i^{(\mathcal T-1)} \\ & f^{(\mathcal T-1)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T-1)}) \\ & \widetilde{f}^{(\mathcal T-1)} = \mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T-1)}} \cdot x^{(\mathcal T-1)} + \underbrace{\mathcal W_{h^{(\mathcal T -2)} \Rightarrow f^{(\mathcal T-1)}}\cdot h^{(\mathcal T -2)} + b_f}_{梯度无关} \end{cases} \end{aligned} y(T)=Wh(T)y(T)h(T)+byh(T)=O(T)m(T)m(T)=Tanh(C(T))C(T)=C(T1)f(T)+C (T)i(T)C(T1)=C(T2)f(T1)+C (T1)i(T1)f(T1)=Sigmoid(f (T1))f (T1)=Wx(T1)f(T1)x(T1)+梯度无关 Wh(T2)f(T1)h(T2)+bf
    因此,该路径的梯度可表示为:
    ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{ \partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right] y(T)L(T)h(T)y(T)[mTh(T)C(T)m(T)C(T1)C(T)f(T1)C(T1)f (T1)f(T1)Wx(T1)f(T1)f (T1)]
  • 细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过遗忘门反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1),后续与第一种情况相同。它的传播路径可表示为:
    { y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) f ( T ) = Sigmoid ( f ~ ( T ) ) f ~ ( T ) = W x ( T ) ⇒ f ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ f ( T ) ⋅ h ( T − 1 ) + b f h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} y(T)=Wh(T)y(T)h(T)+byh(T)=O(T)m(T)m(T)=Tanh(C(T))C(T)=C(T1)f(T)+C (T)i(T)f(T)=Sigmoid(f (T))f (T)=Wx(T)f(T)x(T)+Wh(T1)f(T)h(T1)+bfh(T1)=O(T1)m(T1)
    因此,该路径的梯度可表示为:
    h ( T − 1 ) h^{(\mathcal T - 1)} h(T1) W x ( T − 1 ) ⇒ f ( T − 1 ) \mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T1)f(T1)的梯度路径是固定的,见情况 1 1 1(后 5 5 5个梯度),这里使用省略号表示,下同。
    ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ h ( T − 1 ) ⋯   ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots\right] y(T)L(T)h(T)y(T)[mTh(T)C(T)m(T)f(T)C(T)f (T)f(T)h(T1)f (T)]
  • 细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过输入门反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1),后续与第一种情况相同。它的传播路径可表示为:
    新出现的符号: i ~ ( T ) \widetilde{i}^{(\mathcal T)} i (T)表示输入门的线性计算过程。
    { y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) i ( T ) = Sigmoid ( i ~ ( T ) ) i ~ ( T ) = W x ( T ) ⇒ i ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ i ( T ) ⋅ h ( T − 1 ) + b i h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & i^{(\mathcal T)} = \text{Sigmoid}(\widetilde{i}^{(\mathcal T)}) \\ & \widetilde{i}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)}\Rightarrow i^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow i^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_i \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} y(T)=Wh(T)y(T)h(T)+byh(T)=O(T)m(T)m(T)=Tanh(C(T))C(T)=C(T1)f(T)+C (T)i(T)i(T)=Sigmoid(i (T))i (T)=Wx(T)i(T)x(T)+Wh(T1)i(T)h(T1)+bih(T1)=O(T1)m(T1)
    对应路径梯度可表示为:
    ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ i ( T ) ⋅ ∂ i ( T ) ∂ i ~ ( T ) ⋅ ∂ i ~ ( T ) ∂ h ( T − 1 ) ⋯   ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right] y(T)L(T)h(T)y(T)[mTh(T)C(T)m(T)i(T)C(T)i (T)i(T)h(T1)i (T)]
  • 细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过候选状态 C ~ ( T ) \widetilde{\mathcal C}^{(\mathcal T)} C (T)反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1),后续与第一种情况相同。它的传播路径可表示为:
    { y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) C ~ ( T ) = Tanh [ W x ( T ) ⇒ C ~ ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ C ~ ( T ) ⋅ h ( T − 1 ) + b C ~ ] h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \widetilde{\mathcal C}^{(\mathcal T)} = \text{Tanh} \left[\mathcal W_{x^{(\mathcal T)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_{\widetilde{\mathcal C}}\right] \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} y(T)=Wh(T)y(T)h(T)+byh(T)=O(T)m(T)m(T)=Tanh(C(T))C(T)=C(T1)f(T)+C (T)i(T)C (T)=Tanh[Wx(T)C (T)x(T)+Wh(T1)C (T)h(T1)+bC ]h(T1)=O(T1)m(T1)
    对应路径的梯度可表示为:
    ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ C ~ ( T ) ⋅ ∂ C ~ ( T ) ∂ h ( T − 1 ) ⋯   ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right] y(T)L(T)h(T)y(T)[mTh(T)C(T)m(T)C (T)C(T)h(T1)C (T)]

至此,我们将所有关于 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}}\end{aligned} Wx(T1)f(T1)L(T)全部梯度路径查找完毕。将这些梯度结果进行累加:
其中大括号内的所有部分同上,仅表示各项的累加结果,并非矩阵;其中有 4 4 4条路径是从细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)得到。将其进行简写。
∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) = ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ { ∂ h ( T ) ∂ O ( T ) ⋅ ∂ O ( T ) ∂ h ( T − 1 ) ⋅ ∂ h ( T − 1 ) ∂ m ( T − 1 ) ⋅ ∂ m ( T − 1 ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ⏟ ⋯ + ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ [ ∂ C ( T ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) + ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ h ( T − 1 ) ⋯ + ∂ C ( T ) ∂ i ( T ) ⋅ ∂ i ( T ) ∂ i ~ ( T ) ⋅ ∂ i ~ ( T ) ∂ h ( T − 1 ) ⋯ + ∂ C ( T ) ∂ C ~ ( T ) ⋅ ∂ C ~ ( T ) ∂ h ( T − 1 ) ⋯ ] } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \begin{Bmatrix} \frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \underbrace{\frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}}_{\cdots} \\ \quad \\ +\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \begin{bmatrix} \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}} \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \end{bmatrix} \end{Bmatrix} \end{aligned} Wx(T1)f(T1)L(T)=y(T)L(T)h(T)y(T) O(T)h(T)h(T1)O(T) m(T1)h(T1)C(T1)m(T1)f(T1)C(T1)f (T1)f(T1)Wx(T1)f(T1)f (T1)+mTh(T)C(T)m(T) C(T1)C(T)f(T1)C(T1)f (T1)f(T1)Wx(T1)f(T1)f (T1)+f(T)C(T)f (T)f(T)h(T1)f (T)+i(T)C(T)i (T)i(T)h(T1)i (T)+C (T)C(T)h(T1)C (T)

T − 2 \mathcal T - 2 T2时刻与 T − 1 \mathcal T - 1 T1时刻关于 W x ( t ) ⇒ f ( t ) \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} Wx(t)f(t)梯度的比较

T − 2 \mathcal T - 2 T2时刻的梯度结果 ∂ L ( T ) ∂ W x ( T − 2 ) ⇒ f ( T − 2 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-2)} \Rightarrow f^{(\mathcal T - 2)}}}\end{aligned} Wx(T2)f(T2)L(T)是否与 T − 1 \mathcal T - 1 T1时刻的情况相同呢?不相同。原因在于: T ⇒ T − 1 \mathcal T \Rightarrow \mathcal T - 1 TT1时刻仅包含 h ( T ) h^{(\mathcal T)} h(T)的相关路径,也就是说,它均是从 ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}}\end{aligned} y(T)L(T)h(T)y(T)执行下来的。但是: T − 1 ⇒ T − 2 \mathcal T- 1 \Rightarrow \mathcal T -2 T1T2时刻不仅存在 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1)的相关路径,并且还包含 C ( T − 1 ) \mathcal C^{(\mathcal T - 1)} C(T1)的相关路径
其中关于 h ( T − 1 ) h^{(\mathcal T - 1)} h(T1)的相关路径与 h ( T ) h^{(\mathcal T)} h(T)相同,不再赘述;与 C ( T − 1 ) \mathcal C^{(\mathcal T-1)} C(T1)的相关路径存在如下几种形式。可以看出它们之间确实存在重合的部分,但需要分开进行梯度计算。因为 C ( T − 1 ) \mathcal C^{(\mathcal T-1)} C(T1) h ( T − 1 ) h^{(\mathcal T -1)} h(T1)不是一个东西。
C ( T − 1 ) ⇒ { ∂ C ( T − 1 ) ∂ C ( T − 2 ) ⋅ ∂ C ( T − 2 ) ∂ f ( T − 2 ) ⋅ ∂ f ( T − 2 ) ∂ f ~ ( T − 2 ) ⋅ ∂ f ~ ( T − 2 ) ∂ W x ( T − 2 ) ⇒ f ( T − 2 ) ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ ∂ C ( T − 1 ) ∂ i ( T − 1 ) ⋅ ∂ i ( T − 1 ) ∂ i ~ ( T − 1 ) ⋅ ∂ i ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ ∂ C ( T − 1 ) ∂ C ~ ( T − 1 ) ⋅ ∂ C ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ \mathcal C^{(\mathcal T -1)} \Rightarrow \begin{cases} \begin{aligned} & \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial \mathcal C^{(\mathcal T-2)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 2)}}{\partial f^{(\mathcal T - 2)}} \cdot \frac{\partial f^{(\mathcal T - 2)}}{\partial \widetilde{f}^{(\mathcal T - 2)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-2)}}{\partial \mathcal W_{x^{(\mathcal T-2)}\Rightarrow f^{(\mathcal T-2)}}} \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial h^{(\mathcal T-2)}} \cdots \\ &\frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial i^{(\mathcal T-1)}} \cdot \frac{\partial i^{(\mathcal T-1)}}{\partial \widetilde{i}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \end{aligned} \end{cases} C(T1) C(T2)C(T1)f(T2)C(T2)f (T2)f(T2)Wx(T2)f(T2)f (T2)f(T1)C(T1)f (T1)f(T1)h(T2)f (T1)i(T1)C(T1)i (T1)i(T1)h(T2)i (T1)C (T1)C(T1)h(T2)C (T1)
最终, T ⇒ T − 2 \mathcal T \Rightarrow \mathcal T-2 TT2一共包含 4 × 5 + 4 = 24 4 \times 5 + 4 = 24 4×5+4=24条路经。
这个 + 4 +4 +4是指输出门路径,因为该路径没有经过‘细胞状态’ C ( t ) \mathcal C^{(t)} C(t),因此每一次达到 h ( T ) , h ( T − 1 ) h^{(\mathcal T)},h^{(\mathcal T - 1)} h(T),h(T1)时,它仅存在唯一一条路径向对应的 h ( T − 1 ) , h ( T − 2 ) h^{(\mathcal T - 1)},h^{(\mathcal T - 2)} h(T1),h(T2)传播。
同理, T ⇒ T − 3 \mathcal T \Rightarrow \mathcal T - 3 TT3一共包含 ( 4 × 5 ) × 5 + 4 × 4 = 116 (4 \times 5) \times 5 + 4 \times 4 = 116 (4×5)×5+4×4=116条路径,以此类推。

为什么 LSTM \text{LSTM} LSTM能够抑制梯度消失

随着反向传播深度的增加,反向传播路径的数量呈指数级别增长。即便可能出现梯度消失,也可以从数量的角度进行补充;
例如:关于细胞状态的梯度 ∂ C ( t ) ∂ C ( t − 1 ) \begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned} C(t1)C(t)可以通过各门结构权重参数进行调节:
∂ C ( t ) ∂ C ( t − 1 ) = f ( t ) + ( ∂ C ( t ) ∂ f ( t ) ⋅ ∂ f ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) + ∂ C ( t ) ∂ i ( t ) ⋅ ∂ i ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) + ∂ C ( t ) ∂ C ~ ( t ) ⋅ ∂ C ~ ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) ) = f ( t ) + ( C ( t − 1 ) ⋅ { [ Sigmoid ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ f ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } + C ~ ( t ) ⋅ { [ Sigmoid ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ i ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } + i ( t ) ⋅ { [ Tanh ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ C ~ ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } ) \begin{aligned} \frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}} & = f^{(t)} + \begin{pmatrix} \begin{aligned} & \quad \frac{\partial \mathcal C^{(t)}}{\partial f^{(t)}} \cdot \frac{\partial f^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial i^{(t)}} \cdot \frac{\partial i^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial \widetilde{\mathcal C}^{(t)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \end{aligned} \end{pmatrix} \\ & = f^{(t)} + \begin{pmatrix} \mathcal C^{(t-1)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +\widetilde{\mathcal C}^{(t)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +i^{(t)} \cdot \left\{\left[\text{Tanh}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}}\right\}\cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\}\\ \end{pmatrix} \end{aligned} C(t1)C(t)=f(t)+ f(t)C(t)h(t1)f(t)C(t1)h(t1)+i(t)C(t)h(t1)i(t)C(t1)h(t1)+C (t)C(t)h(t1)C (t)C(t1)h(t1) =f(t)+ C(t1){[Sigmoid()]Wh(t1)f(t)}{O(t1)[Tanh(C(t1))]}+C (t){[Sigmoid()]Wh(t1)i(t)}{O(t1)[Tanh(C(t1))]}+i(t){[Tanh()]Wh(t1)C (t)}{O(t1)[Tanh(C(t1))]}
可以发现:每向前反向传播一个梯度,都回出现 4 4 4项偏导伴随着该时刻梯度的出现,并且其中三项是由当前时刻遗忘门、输入门、输出门的权重参数相互调节决定的。

可以理解为:

  • 整个反向传播过程中,所有时刻门结构的权重均参与到了 ∂ C ( t ) ∂ C ( t − 1 ) \begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned} C(t1)C(t)的调节中,相比于循环神经网络中仅有一个权重矩阵的描述,它的鲁棒性会强很多;
  • 并且循环神经网络中的权重矩阵是纯纯的累积,而 LSTM \text{LSTM} LSTM是各项累加,即便是其中一个时刻某门结构梯度消失,剩余门结构也会做出相应调整,来维持当前时刻梯度。

2023 / 5 / 27 2023/5/27 2023/5/27个人理解
仅仅通过增加路径(增加路径数量 ⇒ \Rightarrow 各路径的时间、空间复杂度)就会完全抵消抵消梯度消失吗 ? ? ?——不可否认,它会抵消一部分,更重要的是,关于 ∂ C ( t ) ∂ C ( t − 1 ) ( t = 2 , ⋯   , T ) \begin{aligned}\frac{\partial\mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned}(t=2,\cdots,\mathcal T) C(t1)C(t)(t=2,,T)内的组成部分:

  • 观察上式,上式结果中不仅包含各门控结构的导数,并且包含各门控结构自身。这意味着:在反向传播计算梯度时,和前馈计算一样,门控结构依然在调节数值的变化。
    回顾‘全连接神经网络’的反向传播过程,没有出现过‘神经元输出’本身也参与反向传播过程计算的情况。

    例如:希望 t t t时刻的细胞状态 C ( t ) \mathcal C^{(t)} C(t)中的序列信息与 t − 1 t-1 t1时刻的序列信息 C ( t − 1 ) \mathcal C^{(t-1)} C(t1)无区别(理想状态)—— x ( t ) x^{(t)} x(t)的信息要丢弃它(换句话说,不希望 x ( t ) x^{(t)} x(t) C ( t − 1 ) \mathcal C^{(t-1)} C(t1)产生影响),那么在反向传播过程中,通过调整 f ( t ) , i ( t ) , O ( t ) f^{(t)},i^{(t)},\mathcal O^{(t)} f(t),i(t),O(t)参数比例,让这部分梯度 ∂ C ( t ) ∂ C ( t − 1 ) = 1 \begin{aligned}\frac{\partial\mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned}=1 C(t1)C(t)=1。反向传播时,该时刻的梯度与上时刻的梯度相同,从而达到目的。
    欢迎小伙伴们批评指正~

相关参考:
LSTM如何缓解梯度消失(公式推导)

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

静静的喝酒

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值