循环神经网络RNN公式推导

RNN介绍

  RNN(Recurrent Neural Network),也称循环神经网络,是以序列为输入的神经网络,当前时刻的输出,不仅受当前时刻的输入影响,也受以前时刻输入的影响,因而该网络具有一定的记忆功能。RNN广泛应用于视频处理, 语言模型, 图像处理等领域,其结构图如下:

  上图中 x t x^{t} xt表示t时刻的输入,U是输入到隐藏层的权重矩阵,V是隐藏层到输出的权重矩阵,W是连接两个相邻时刻隐藏层的权重矩阵,在各个时刻中,U,V,W都是共享的。

前向计算

  RNN的前向计算过程比较简单,在介绍前向计算之前,我们先定义一些符号概念。下图是一个更详细一点的RNN示意图。

  让我们假定输入输出层神经元的大小 C = 5 C=5 C=5,隐藏层的大小为 H = 4 H=4 H=4,则有:
x ( t ) ∈ R C × 1 U ∈ R H × C s ( t ) ∈ R H × 1 V ∈ R C × H W ∈ R H × H o ( t ) ∈ R H × 1 y ( t ) ∈ R H × 1 x^{(t)} \in R^{C \times 1} \\ U \in R^{H \times C} \\ s^{(t)} \in R^{H \times 1} \\ V \in R^{C \times H} \\ W \in R^{H \times H} \\ o^{(t)} \in R^{H \times1}\\ y^{(t)} \in R^{H \times 1} x(t)RC×1URH×Cs(t)RH×1VRC×HWRH×Ho(t)RH×1y(t)RH×1
  其中:
s ( t ) = t a n h ( U x ( t ) + W s ( t − 1 ) + b s ) ( 1.1 ) a ( t ) = U x ( t ) + W s ( t − 1 ) + b s ( 1.2 ) o ( t ) = s o f t m a x ( V s ( t ) + b o ) = [ o 1 ( t ) , o 2 ( t ) , o 3 ( t ) , o 4 ( t ) , o 5 ( t ) ] ( 1.3 ) z ( t ) = V s ( t ) + b o = [ z 1 ( t ) , z 2 ( t ) , z 3 ( t ) , z 4 ( t ) , z 5 ( t ) ] ( 1.4 ) \begin{aligned} s^{(t)} &=tanh( Ux^{(t)} + Ws^{(t-1)} + b_s) \hspace9ex &&(1.1)\\ a^{(t)} &= Ux^{(t)} + Ws^{(t-1)} + b_s \hspace9ex &&(1.2) \\ o^{(t)} &=softmax( Vs^{(t)} + b_o) =[ o^{(t)}_1, o^{(t)}_2,o^{(t)}_3,o^{(t)}_4,o^{(t)}_5] \hspace9ex &&(1.3)\\ z^{(t)} &= Vs^{(t)} + b_o = [ z^{(t)}_1, z^{(t)}_2,z^{(t)}_3,z^{(t)}_4,z^{(t)}_5 ] \hspace9ex &&(1.4) \end{aligned} s(t)a(t)o(t)z(t)=tanh(Ux(t)+Ws(t1)+bs)=Ux(t)+Ws(t1)+bs=softmax(Vs(t)+bo)=[o1(t),o2(t),o3(t),o4(t),o5(t)]=Vs(t)+bo=[z1(t),z2(t),z3(t),z4(t),z5(t)](1.1)(1.2)(1.3)(1.4)

o i ( t ) = e x p ( z i ( t ) ) ∑ j e x p ( z j ( t ) ) ( 1.5 ) o^{(t)}_i = \frac{ exp(z^{(t)}_i) }{ \sum_j exp(z^{(t)}_j) } \hspace9ex (1.5) oi(t)=jexp(zj(t))exp(zi(t))(1.5)
  其中 o i ( t ) o^{(t)}_i oi(t)表示 x ( t ) x^{(t)} x(t)属于第 i i i个类别的概率, ∑ j o j ( t ) = 1 \sum_j o^{(t)}_j =1 joj(t)=1,有了上面的定义和前向传播,我们就可以使用反向传播来更新参数。在反向传播计算之前我们需要确定损失函数,RNN的损失函数是每个时刻的误差之和,我们希望总的误差最小。在神经网络的多分类模型中,我们一般使用softmax层与负对数似然损失。
E = ∑ t E t ( 1.6 ) E t = − ∑ i = 1 H y i ( t ) log ⁡ o i ( t ) = − log ⁡ o y ( t ) = 1 ( t ) ( 1.7 ) \begin{aligned} E &= \sum_t E_t \hspace9ex &&(1.6)\\ E_t &= - \sum_{i=1}^H y_i^{(t)} \log o_i^{(t)}=-\log o^{(t)}_{y_{(t)=1}} \hspace9ex &&(1.7) \end{aligned} EEt=tEt=i=1Hyi(t)logoi(t)=logoy(t)=1(t)(1.6)(1.7)
  其中 o y ( t ) = 1 ( t ) o^{(t)}_{y_{(t)=1}} oy(t)=1(t)表示,取 o ( t ) o^{(t)} o(t)中下标为 y ( t ) y^{(t)} y(t)中元素为1对应下标的位置,这句话简单的来说就是,如果 y ( t ) y^{(t)} y(t)中第2个元素的值为1,则取 o y ( t ) = 1 ( t ) = o 2 ( t ) o^{(t)}_{y_{(t)=1}}=o^{(t)}_2 oy(t)=1(t)=o2(t),注意 y ( t ) y^{(t)} y(t)是one-hot编码的向量, y ( t ) y^{(t)} y(t)中有且只有一个元素的值为1,其余的值全为0。

反向传播

  RNN的反向传播算法是Backpropagation Through Time (BPTT),它的基本原理和BP算法是一样的,可以分为下面三个步骤:

  1. 向计算每个神经元的输出值;
  2. 反向计算每个神经元的误差项 δ ( t ) \delta^{(t)} δ(t)
  3. 计算每个权重的梯度,更新梯度。

t t t时刻的输出层第 j j j个神经元的误差项我们用 δ o j ( t ) \delta_{oj}^{(t)} δoj(t)表示:
δ o j ( t ) = ∂ E t ∂ z j ( t ) = ∑ i = 1 C ∂ E t ∂ o i ( t ) ∂ o i ( t ) ∂ z j ( t ) = − ∑ i = 1 C ∂ ∑ k = 1 H y k ( t ) log ⁡ o k ( t ) ∂ o i ( t ) ∂ o i ( t ) ∂ z j ( t ) = − ∑ i = 1 C y i ( t ) o i ( t ) ∂ o i ( t ) ∂ z j ( t ) ( 2.1 ) \begin{aligned} \delta^{(t)}_{oj} &= \frac{ \partial E_t}{ \partial z^{(t)}_j } = \sum_{i=1}^C \frac{ \partial E_t}{ \partial o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \sum_{i=1}^C \frac{ \partial \sum_{k=1}^H y_k^{(t)} \log o_k^{(t)} }{ \partial o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \sum_{i=1}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \hspace9ex &&(2.1) \end{aligned} δoj(t)=zj(t)Et=i=1Coi(t)Etzj(t)oi(t)=i=1Coi(t)k=1Hyk(t)logok(t)zj(t)oi(t)=i=1Coi(t)yi(t)zj(t)oi(t)(2.1)
对于式(2.1)当 i = j i=j i=j时:
∂ o i ( t ) ∂ z j ( t ) = ∂ ( e x p ( z j ( t ) ) ∑ k e x p ( z k ( t ) ) ) / ∂ z j ( t ) = [ ∑ k e x p ( z k ( t ) ) ] D [ e x p ( z j ( t ) ) ] − e x p ( z j ( t ) ) D [ ∑ k e x p ( z k ( t ) ) ] [ ∑ k e x p ( z k ( t ) ) ] 2 = [ ∑ k e x p ( z k ( t ) ) ] e x p ( z j ( t ) ) − e x p ( z j ( t ) ) e x p ( z j ( t ) ) [ ∑ k e x p ( z k ( t ) ) ] 2 = e x p ( z j ( t ) ) ∑ k e x p ( z k ( t ) ) ∑ k e x p ( z k ( t ) ) − e x p ( z j ( t ) ) ∑ k e x p ( z k ( t ) ) = o j ( t ) ( 1 − o j ( t ) ) ( 2.2 ) \begin{aligned} \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } &= \partial \left( \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \right) / \partial z^{(t)}_j \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] \boldsymbol{D} [exp(z^{(t)}_j)] - exp(z^{(t)}_j) \boldsymbol{D} [\sum_k exp(z^{(t)}_k) ]}{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] exp(z^{(t)}_j) - exp(z^{(t)}_j) exp(z^{(t)}_j) }{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \frac{ \sum_k exp(z^{(t)}_k) - exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \\ &= o_j^{(t)} (1- o_j^{(t)}) \hspace9ex &&(2.2) \end{aligned} zj(t)oi(t)=(kexp(zk(t))exp(zj(t)))/zj(t)=[kexp(zk(t))]2[kexp(zk(t))]D[exp(zj(t))]exp(zj(t))D[kexp(zk(t))]=[kexp(zk(t))]2[kexp(zk(t))]exp(zj(t))exp(zj(t))exp(zj(t))=kexp(zk(t))exp(zj(t))kexp(zk(t))kexp(zk(t))exp(zj(t))=oj(t)(1oj(t))(2.2)
对于式(2.1)当 i ≠ j i \ne j i=j时:
∂ o i ( t ) ∂ z j ( t ) = ∂ ( e x p ( z i ( t ) ) ∑ k e x p ( z k ( t ) ) ) / ∂ z j ( t ) = [ ∑ k e x p ( z k ( t ) ) ] D [ e x p ( z i ( t ) ) ] − e x p ( z i ( t ) ) D [ ∑ k e x p ( z k ( t ) ) ] [ ∑ k e x p ( z k ( t ) ) ] 2 = [ ∑ k e x p ( z k ( t ) ) ] ∗ 0 − e x p ( z i ( t ) ) e x p ( z j ( t ) ) [ ∑ k e x p ( z k ( t ) ) ] 2 = − e x p ( z i ( t ) ) ∑ k e x p ( z k ( t ) ) e x p ( z j ( t ) ) ∑ k e x p ( z k ( t ) ) = − o i ( t ) o j ( t ) ( 2.3 ) \begin{aligned} \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } &= \partial \left( \frac{ exp(z^{(t)}_i) }{ \sum_k exp(z^{(t)}_k) } \right) / \partial z^{(t)}_j \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] \boldsymbol{D} [exp(z^{(t)}_i)] - exp(z^{(t)}_i) \boldsymbol{D} [\sum_k exp(z^{(t)}_k) ]}{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] *0 - exp(z^{(t)}_i) exp(z^{(t)}_j) }{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= -\frac{ exp(z^{(t)}_i) }{ \sum_k exp(z^{(t)}_k) } \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \\ &= - o_i^{(t)} o_j^{(t)} \hspace9ex &&(2.3) \end{aligned} zj(t)oi(t)=(kexp(zk(t))exp(zi(t)))/zj(t)=[kexp(zk(t))]2[kexp(zk(t))]D[exp(zi(t))]exp(zi(t))D[kexp(zk(t))]=[kexp(zk(t))]2[kexp(zk(t))]0exp(zi(t))exp(zj(t))=kexp(zk(t))exp(zi(t))kexp(zk(t))exp(zj(t))=oi(t)oj(t)(2.3)
综上,且 y ( t ) y^{(t)} y(t)是one-hot编码,有 ∑ i y i ( t ) = 1 \sum_i y^{(t)}_i = 1 iyi(t)=1,所以输出层第 j j j个神经元的误差项 δ o j ( t ) \delta_{oj}^{(t)} δoj(t)
δ o j ( t ) = − ∑ i = 1 C y i ( t ) o i ( t ) ∂ o i ( t ) ∂ z j ( t ) = − y j ( t ) o j ( t ) ∂ o j ( t ) ∂ z j ( t ) − ∑ i = 1 , i ≠ j C y i ( t ) o i ( t ) ∂ o i ( t ) ∂ z j ( t ) = − y j ( t ) o j ( t ) o j ( t ) ( 1 − o j ( t ) ) − ∑ i = 1 , i ≠ j C y i ( t ) o i ( t ) ( − o i ( t ) o j ( t ) ) = y j ( t ) ( o j ( t ) − 1 ) + ∑ i = 1 , i ≠ j C y i ( t ) o j ( t ) = ∑ i = 1 C y i ( t ) o j ( t ) − y j ( t ) = o j ( t ) − y j ( t ) ( 2.4 ) \begin{aligned} \delta^{(t)}_{oj} &= - \sum_{i=1}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \frac{ y_j^{(t)} }{ o_j^{(t)} } \frac{ \partial o_j^{(t)} }{ \partial z^{(t)}_j } - \sum_{i=1,i \ne j}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \frac{ y_j^{(t)} }{ o_j^{(t)} } o_j^{(t)} (1- o_j^{(t)}) - \sum_{i=1,i \ne j}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } (- o_i^{(t)} o_j^{(t)}) \\ &= y_j^{(t)} ( o_j^{(t)} -1 ) + \sum_{i=1,i \ne j}^C y_i^{(t)} o_j^{(t)} \\ &= \sum_{i=1}^C y_i^{(t)} o_j^{(t)} - y_j^{(t)} = o_j^{(t)} - y_j^{(t)} \hspace9ex &&(2.4) \end{aligned} δoj(t)=i=1Coi(t)yi(t)zj(t)oi(t)=oj(t)yj(t)zj(t)oj(t)i=1,i=jCoi(t)yi(t)zj(t)oi(t)=oj(t)yj(t)oj(t)(1oj(t))i=1,i=jCoi(t)yi(t)(oi(t)oj(t))=yj(t)(oj(t)1)+i=1,i=jCyi(t)oj(t)=i=1Cyi(t)oj(t)yj(t)=oj(t)yj(t)(2.4)
式(2.4)式是输出层一个神经元的误差项,那么 t t t时刻整个输出层的误差项 δ o ( t ) \delta^{(t)}_o δo(t)表示为:
δ o ( t ) = o ( t ) − y ( t ) ( 2.5 ) \delta^{(t)}_o = o^{(t)} - y^{(t)} \hspace9ex (2.5) δo(t)=o(t)y(t)(2.5)
  计算了输出层的误差项,我们接下来看隐藏层神经元 j j j的误差项 δ h j ( t ) \delta_{hj}^{(t)} δhj(t),隐藏层神经元的误差项的计算比输出层稍微复杂些,分为两种情况:

  1. 在最后时刻 T T T,隐藏层神经元的误差项只来自于后一层(输出层);
  2. 在中间时刻 t t t,隐藏层神经元的误差来自于后一层和下一时刻 t + 1 t+1 t+1隐藏层神经元的误差之和。

  第一种情况,在最后时刻 T T T,隐藏层神经元误差项的计算:
δ h j ( T ) = ∂ E T ∂ a j ( T ) = ∑ i = 1 C ∂ E T ∂ z i ( T ) ∂ z i ( T ) ∂ a j ( T ) = ∑ i = 1 C ∑ k = 1 H ∂ E T ∂ z i ( T ) ∂ z i ( T ) ∂ s k ( T ) ∂ s k ( T ) ∂ a j ( T ) ( 2.6 ) \begin{aligned} \delta^{(T)}_{hj} &= \frac{ \partial E_T}{ \partial a^{(T)}_j } = \sum_{i=1}^C \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \sum_{k=1}^H \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_k } \frac{ \partial s_k^{(T)} }{ \partial a^{(T)}_j } \hspace9ex &&(2.6) \end{aligned} δhj(T)=aj(T)ET=i=1Czi(T)ETaj(T)zi(T)=i=1Ck=1Hzi(T)ETsk(T)zi(T)aj(T)sk(T)(2.6)
  由式(1.1): s ( t ) = t a n h ( U x ( t ) + W s ( t − 1 ) + b s ) = t a n h ( a ( t ) ) s^{(t)} =tanh( Ux^{(t)} + Ws^{(t-1)} + b_s) = tanh(a^{(t)}) s(t)=tanh(Ux(t)+Ws(t1)+bs)=tanh(a(t))可得任意时刻 t t t
∂ s k ( t ) ∂ a j ( t ) = ∂ t a n h ( a k ( t ) ) ∂ a j ( t ) = { 1 − s j ( t ) 2 if  k = j   0 others = I ( k = j ) ( 1 − s j ( t ) 2 ) ( 2.7 ) \begin{aligned} \frac{ \partial s_k^{(t)} }{ \partial a^{(t)}_j } &= \frac{ \partial tanh(a_k^{(t)}) }{ \partial a_j^{(t)} } = \begin{cases} 1-s_j^{(t)2} & \text{if $k=j$ } \\ 0 & \text{others} \end{cases} = I(k=j) (1-s^{(t)2}_j) \hspace9ex &&(2.7) \end{aligned} aj(t)sk(t)=aj(t)tanh(ak(t))={1sj(t)20if k=j others=I(k=j)(1sj(t)2)(2.7)
  其中 I ( k = j ) I(k=j) I(k=j)是一个指示函数,条件 k = j k=j k=j为真时取1,否则取0。
  由式(1.4): z ( t ) = V s ( t ) + b o z^{(t)} = Vs^{(t)} + b_o z(t)=Vs(t)+bo可得任意时刻 t t t
∂ z i ( t ) ∂ s k ( t ) = ∂ V i ∙ s ( t ) + b o i ∂ s k ( t ) = V i k ( 2.8 ) \begin{aligned} \frac{ \partial z_i^{(t)} }{ \partial s^{(t)}_k } &= \frac{ \partial V_{i \bullet}s^{(t)} + b_{oi} }{ \partial s^{(t)}_k } = V_{ik} \hspace9ex &&(2.8) \end{aligned} sk(t)zi(t)=sk(t)Vis(t)+boi=Vik(2.8)
  其中 V i ∙ V_{i \bullet} Vi表示矩阵 V V V的第 i i i行。将式(2.7),(2.8)代入(2.6)得:
δ h j ( T ) = ∑ i = 1 C ∑ k = 1 H ∂ E T ∂ z i ( T ) ∂ z i ( T ) ∂ s k ( T ) ∂ s k ( T ) ∂ a j ( T ) = ∑ i = 1 C ∂ E T ∂ z i ( T ) ∂ z i ( T ) ∂ s j ( T ) ∂ s j ( T ) ∂ a j ( T ) = ∑ i = 1 C δ o i ( T ) V i j ( 1 − s j ( T ) 2 ) = ( 1 − s j ( T ) 2 ) ∑ i = 1 C δ o i ( T ) V i j = ( 1 − s j ( T ) 2 ) [ V ∙ j ] T δ o ( T ) ( 2.9 ) \begin{aligned} \delta^{(T)}_{hj} &= \sum_{i=1}^C \sum_{k=1}^H \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_k } \frac{ \partial s_k^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_j } \frac{ \partial s_j^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \delta^{(T)}_{oi} V_{ij} (1-s^{(T)2}_j) \\ &= (1-s^{(T)2}_j) \sum_{i=1}^C \delta^{(T)}_{oi} V_{ij} \\ &= (1-s^{(T)2}_j) [V_{\bullet j}]^T \delta^{(T)}_o \hspace9ex &&(2.9) \end{aligned} δhj(T)=i=1Ck=1Hzi(T)ETsk(T)zi(T)aj(T)sk(T)=i=1Czi(T)ETsj(T)zi(T)aj(T)sj(T)=i=1Cδoi(T)Vij(1sj(T)2)=(1sj(T)2)i=1Cδoi(T)Vij=(1sj(T)2)[Vj]Tδo(T)(2.9)
  第二种情况,在中间时刻 t t t,隐藏层的误差计算如下:
δ h j ( t ) = ∂ E t ∂ a j ( t ) + ∑ l = t + 1 T ∑ k = 1 H ∂ E l ∂ a k ( t + 1 ) ∂ a k ( t + 1 ) ∂ a j ( t ) ( 2.10 ) \begin{aligned} \delta^{(t)}_{hj} &= \frac{ \partial E_t}{ \partial a^{(t)}_j } + \sum_{l=t+1}^T \sum_{k=1}^H \frac{ \partial E_l}{ \partial a_k^{(t+1)} } \frac{ \partial a_k^{(t+1)} }{ \partial a^{(t)}_j } \hspace9ex &&(2.10) \end{aligned} δhj(t)=aj(t)Et+l=t+1Tk=1Hak(t+1)Elaj(t)ak(t+1)(2.10)
再计算时刻t之后各个时刻的隐藏层对当前时刻的影响:
∑ l = t + 1 T ∑ k = 1 H ∂ E l ∂ a k ( t + 1 ) ∂ a k ( t + 1 ) ∂ a j ( t ) = ∑ k = 1 H δ h k ( t + 1 ) ∂ a k ( t + 1 ) ∂ a j ( t ) = ∑ k = 1 H δ h k ( t + 1 ) ∂ a k ( t + 1 ) ∂ s j ( t ) ∂ s j ( t ) ∂ a j ( t ) = ∑ k = 1 H δ h k ( t + 1 ) W k j ( 1 − s j ( t ) 2 ) = [ W ∙ j ] T δ h ( t + 1 ) ( 1 − s j ( t ) 2 ) ( 2.11 ) \begin{aligned} \sum_{l=t+1}^T \sum_{k=1}^H \frac{ \partial E_l}{ \partial a_k^{(t+1)} } \frac{ \partial a_k^{(t+1)} }{ \partial a^{(t)}_j } &= \sum_{k=1}^H \delta^{(t+1)}_{hk} \frac{ \partial a^{(t+1)}_k }{ \partial a^{(t)}_j } \\ &= \sum_{k=1}^H \delta^{(t+1)}_{hk} \frac{ \partial a^{(t+1)}_k }{ \partial s^{(t)}_j } \frac{ \partial s^{(t)}_j }{ \partial a^{(t)}_j } \\ &= \sum_{k=1}^H \delta^{(t+1)}_{hk} W_{kj} (1-s^{(t)2}_j) \\ &= [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.11) \end{aligned} l=t+1Tk=1Hak(t+1)Elaj(t)ak(t+1)=k=1Hδhk(t+1)aj(t)ak(t+1)=k=1Hδhk(t+1)sj(t)ak(t+1)aj(t)sj(t)=k=1Hδhk(t+1)Wkj(1sj(t)2)=[Wj]Tδh(t+1)(1sj(t)2)(2.11)
  注意(2.11)中的第一个等号可能不太好理解, ∑ l = t + 1 T ∂ E l ∂ a k ( t + 1 ) = δ h k ( t + 1 ) \sum_{l=t+1}^T \frac{ \partial E_l}{ \partial a_k^{(t+1)} } =\delta^{(t+1)}_{hk} l=t+1Tak(t+1)El=δhk(t+1),这是根据隐藏层误差项的定义:从结束时刻T到当前时刻t,每个时刻的 E t E_t Et关于加权输入 a k ( t + 1 ) a_k^{(t+1)} ak(t+1)的偏导数定义为 δ h k ( t + 1 ) \delta^{(t+1)}_{hk} δhk(t+1)。将(2.11)代入(2.10)可得:
δ h j ( t ) = ( 1 − s j ( t ) 2 ) [ V ∙ j ] T δ o ( t ) + [ W ∙ j ] T δ h ( t + 1 ) ( 1 − s j ( t ) 2 ) ( 2.12 ) \begin{aligned} \delta^{(t)}_{hj} &= (1-s^{(t)2}_j) [V_{\bullet j}]^T \delta^{(t)}_o + [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.12) \end{aligned} δhj(t)=(1sj(t)2)[Vj]Tδo(t)+[Wj]Tδh(t+1)(1sj(t)2)(2.12)
综上,令 δ ( T + 1 ) = 0 ⃗ \delta^{(T+1)}=\vec 0 δ(T+1)=0 ,则对于任意时刻 t t t隐藏层神经元 j j j的误差项表达如下:
δ h j ( t ) = ( 1 − s j ( t ) 2 ) [ V ∙ j ] T δ o ( t ) + [ W ∙ j ] T δ h ( t + 1 ) ( 1 − s j ( t ) 2 ) ( 2.13 ) δ h ( t ) = V T δ o ( t ) ⊙ ( 1 − s ( t ) 2 ) + W T δ h ( t + 1 ) ⊙ ( 1 − s ( t ) 2 ) ( 2.14 ) \begin{aligned} \delta^{(t)}_{hj} &= (1-s^{(t)2}_j) [V_{\bullet j}]^T \delta^{(t)}_o + [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.13) \\ \delta^{(t)}_{h} &= V^T \delta^{(t)}_o \odot (1-s^{(t)2}) + W^T \delta^{(t+1)}_{h} \odot (1-s^{(t)2}) \hspace9ex &&(2.14) \end{aligned} δhj(t)δh(t)=(1sj(t)2)[Vj]Tδo(t)+[Wj]Tδh(t+1)(1sj(t)2)=VTδo(t)(1s(t)2)+WTδh(t+1)(1s(t)2)(2.13)(2.14)
经过漫长的计算,现在已经将每个神经元的误差项计算出来了,下面将利用误差项计算梯度。 V V V只能通过影响当前时刻的输出来影响误差,而 W W W能影响当前时刻和以后各个时刻的输出。
∂ E t ∂ V j i = ∂ E t ∂ z j ( t ) ∂ z j ( t ) ∂ V j i = δ o j ( t ) s i ( t ) ( 2.15 ) ∂ E t ∂ W j i = ∑ l = t T ∂ E l ∂ a j ( t ) ∂ a j ( t ) ∂ W j i = δ h j ( t ) s i ( t − 1 ) ( 2.16 ) ∂ E t ∂ U j i = ∑ l = t T ∂ E l ∂ a j ( t ) ∂ a j ( t ) ∂ U j i = δ h j ( t ) x i ( t ) ( 2.17 ) ∂ E t ∂ b s j = ∑ l = t T ∂ E l ∂ a j ( t ) ∂ a j ( t ) ∂ b s j = δ h j ( t ) ( 2.18 ) ∂ E t ∂ b o j = ∂ E t ∂ z j ( t ) ∂ z j ( t ) ∂ b o j = δ o j ( t ) ( 2.19 ) \begin{aligned} \frac{\partial E_t}{ \partial V_{ji} } &= \frac{\partial E_t}{ \partial z_j^{(t)} } \frac{\partial z_j^{(t)} }{ \partial V_{ji} } = \delta^{(t)}_{oj} s_i^{(t)} \hspace9ex &&(2.15) \\ \frac{\partial E_t}{ \partial W_{ji} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial W_{ji} } = \delta^{(t)}_{hj} s_i^{(t-1)} \hspace9ex &&(2.16) \\ \frac{\partial E_t}{ \partial U_{ji} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial U_{ji} } = \delta^{(t)}_{hj} x_i^{(t)} \hspace9ex &&(2.17) \\ \frac{\partial E_t}{ \partial b_{sj} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial b_{sj} } = \delta^{(t)}_{hj} \hspace9ex &&(2.18) \\ \frac{\partial E_t}{ \partial b_{oj} } &= \frac{\partial E_t}{ \partial z_j^{(t)} } \frac{\partial z_j^{(t)} }{ \partial b_{oj} } = \delta^{(t)}_{oj} \hspace9ex &&(2.19) \end{aligned} VjiEtWjiEtUjiEtbsjEtbojEt=zj(t)EtVjizj(t)=δoj(t)si(t)=l=tTaj(t)ElWjiaj(t)=δhj(t)si(t1)=l=tTaj(t)ElUjiaj(t)=δhj(t)xi(t)=l=tTaj(t)Elbsjaj(t)=δhj(t)=zj(t)Etbojzj(t)=δoj(t)(2.15)(2.16)(2.17)(2.18)(2.19)
式(2.14)-(2.18)是各个时刻权重标量的的梯度,下面我们将其矢量化:

∂ E t ∂ V = [ δ o 1 ( t ) s 1 ( t ) δ o 1 ( t ) s 2 ( t ) ⋯ δ o 1 ( t ) s H ( t ) δ o 2 ( t ) s 1 ( t ) δ o 2 ( t ) s 2 ( t ) ⋯ δ o 2 ( t ) s H ( t ) ⋮ ⋮ ⋮ ⋮ δ o C ( t ) s 1 ( t ) δ o C ( t ) s 2 ( t ) ⋯ δ o C ( t ) s H ( t ) ] = δ o ( t ) ⊗ s ( t ) = ( o ( t ) − y ( t ) ) ⊗ s ( t ) ( 2.20 ) ∂ E t ∂ W = [ δ h 1 ( t ) s 1 ( t − 1 ) δ h 1 ( t ) s 2 ( t − 1 ) ⋯ δ h 1 ( t ) s H ( t − 1 ) δ h 2 ( t ) s 1 ( t − 1 ) δ h 2 ( t ) s 2 ( t − 1 ) ⋯ δ h 2 ( t ) s H ( t − 1 ) ⋮ ⋮ ⋮ ⋮ δ h H ( t ) s 1 ( t − 1 ) δ h H ( t ) s 2 ( t − 1 ) ⋯ δ h H ( t ) s H ( t − 1 ) ] = δ h ( t ) ⊗ s ( t − 1 ) = [ V T δ o ( t ) ⊙ ( 1 − s ( t ) 2 ) + W T δ h ( t + 1 ) ⊙ ( 1 − s ( t ) 2 ) ] ⊗ s ( t − 1 ) ( 2.21 ) ∂ E t ∂ U = [ δ h 1 ( t ) x 1 ( t ) δ h 1 ( t ) x 2 ( t ) ⋯ δ h 1 ( t ) x C ( t ) δ h 2 ( t ) x 1 ( t ) δ h 2 ( t ) x 2 ( t ) ⋯ δ h 2 ( t ) x C ( t ) ⋮ ⋮ ⋮ ⋮ δ h H ( t ) x 1 ( t ) δ h H ( t ) x 2 ( t ) ⋯ δ h H ( t ) x C ( t ) ] = δ h ( t ) ⊗ x ( t ) ( 2.22 ) ∂ E t ∂ b s = δ h ( t ) ( 2.23 ) ∂ E t ∂ b o = δ o ( t ) ( 2.24 ) \begin{aligned} \frac{\partial E_t}{ \partial V } &= \begin{bmatrix} \delta_{o1}^{(t)} s_1^{(t)} & \delta_{o1}^{(t)} s_2^{(t)} & \cdots & \delta_{o1}^{(t)} s_H^{(t)} \\ \delta_{o2}^{(t)} s_1^{(t)} & \delta_{o2}^{(t)} s_2^{(t)} & \cdots & \delta_{o2}^{(t)} s_H^{(t)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{oC}^{(t)} s_1^{(t)} & \delta_{oC}^{(t)} s_2^{(t)} & \cdots & \delta_{oC}^{(t)} s_H^{(t)} \end{bmatrix} = \delta_o^{(t)} \otimes s^{(t)} = ( o^{(t)} - y^{(t)}) \otimes s^{(t)} \hspace9ex &&(2.20) \\ \frac{\partial E_t}{ \partial W } &= \begin{bmatrix} \delta_{h1}^{(t)} s_1^{(t-1)} & \delta_{h1}^{(t)} s_2^{(t-1)} & \cdots & \delta_{h1}^{(t)} s_H^{(t-1)} \\ \delta_{h2}^{(t)} s_1^{(t-1)} & \delta_{h2}^{(t)} s_2^{(t-1)} & \cdots & \delta_{h2}^{(t)} s_H^{(t-1)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{hH}^{(t)} s_1^{(t-1)} & \delta_{hH}^{(t)} s_2^{(t-1)} & \cdots & \delta_{hH}^{(t)} s_H^{(t-1)} \end{bmatrix} = \delta_h^{(t)} \otimes s^{(t-1)} \\ &=[V^T \delta^{(t)}_o \odot (1-s^{(t)2}) + W^T \delta^{(t+1)}_{h} \odot (1-s^{(t)2})] \otimes s^{(t-1)} \hspace9ex &&(2.21) \\ \frac{\partial E_t}{ \partial U} &= \begin{bmatrix} \delta_{h1}^{(t)} x_1^{(t)} & \delta_{h1}^{(t)} x_2^{(t)} & \cdots & \delta_{h1}^{(t)} x_C^{(t)} \\ \delta_{h2}^{(t)} x_1^{(t)} & \delta_{h2}^{(t)} x_2^{(t)} & \cdots & \delta_{h2}^{(t)} x_C^{(t)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{hH}^{(t)} x_1^{(t)} & \delta_{hH}^{(t)} x_2^{(t)} & \cdots & \delta_{hH}^{(t)} x_C^{(t)} \end{bmatrix} = \delta_h^{(t)} \otimes x^{(t)} \hspace9ex &&(2.22) \\ \frac{\partial E_t}{ \partial b_s } & = \delta^{(t)}_h \hspace9ex &&(2.23) \\ \frac{\partial E_t}{ \partial b_o } &= \delta^{(t)}_o \hspace9ex &&(2.24) \end{aligned} VEtWEtUEtbsEtboEt=δo1(t)s1(t)δo2(t)s1(t)δoC(t)s1(t)δo1(t)s2(t)δo2(t)s2(t)δoC(t)s2(t)δo1(t)sH(t)δo2(t)sH(t)δoC(t)sH(t)=δo(t)s(t)=(o(t)y(t))s(t)=δh1(t)s1(t1)δh2(t)s1(t1)δhH(t)s1(t1)δh1(t)s2(t1)δh2(t)s2(t1)δhH(t)s2(t1)δh1(t)sH(t1)δh2(t)sH(t1)δhH(t)sH(t1)=δh(t)s(t1)=[VTδo(t)(1s(t)2)+WTδh(t+1)(1s(t)2)]s(t1)=δh1(t)x1(t)δh2(t)x1(t)δhH(t)x1(t)δh1(t)x2(t)δh2(t)x2(t)δhH(t)x2(t)δh1(t)xC(t)δh2(t)xC(t)δhH(t)xC(t)=δh(t)x(t)=δh(t)=δo(t)(2.20)(2.21)(2.22)(2.23)(2.24)
  其中, ⊗ \otimes 表示外积, ⊙ \odot 表示点积即矩阵对应位置相乘。到此,RNN的公式推导结束。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值