RNN介绍
RNN(Recurrent Neural Network),也称循环神经网络,是以序列为输入的神经网络,当前时刻的输出,不仅受当前时刻的输入影响,也受以前时刻输入的影响,因而该网络具有一定的记忆功能。RNN广泛应用于视频处理, 语言模型, 图像处理等领域,其结构图如下:
上图中 x t x^{t} xt表示t时刻的输入,U是输入到隐藏层的权重矩阵,V是隐藏层到输出的权重矩阵,W是连接两个相邻时刻隐藏层的权重矩阵,在各个时刻中,U,V,W都是共享的。
前向计算
RNN的前向计算过程比较简单,在介绍前向计算之前,我们先定义一些符号概念。下图是一个更详细一点的RNN示意图。
让我们假定输入输出层神经元的大小
C
=
5
C=5
C=5,隐藏层的大小为
H
=
4
H=4
H=4,则有:
x
(
t
)
∈
R
C
×
1
U
∈
R
H
×
C
s
(
t
)
∈
R
H
×
1
V
∈
R
C
×
H
W
∈
R
H
×
H
o
(
t
)
∈
R
H
×
1
y
(
t
)
∈
R
H
×
1
x^{(t)} \in R^{C \times 1} \\ U \in R^{H \times C} \\ s^{(t)} \in R^{H \times 1} \\ V \in R^{C \times H} \\ W \in R^{H \times H} \\ o^{(t)} \in R^{H \times1}\\ y^{(t)} \in R^{H \times 1}
x(t)∈RC×1U∈RH×Cs(t)∈RH×1V∈RC×HW∈RH×Ho(t)∈RH×1y(t)∈RH×1
其中:
s
(
t
)
=
t
a
n
h
(
U
x
(
t
)
+
W
s
(
t
−
1
)
+
b
s
)
(
1.1
)
a
(
t
)
=
U
x
(
t
)
+
W
s
(
t
−
1
)
+
b
s
(
1.2
)
o
(
t
)
=
s
o
f
t
m
a
x
(
V
s
(
t
)
+
b
o
)
=
[
o
1
(
t
)
,
o
2
(
t
)
,
o
3
(
t
)
,
o
4
(
t
)
,
o
5
(
t
)
]
(
1.3
)
z
(
t
)
=
V
s
(
t
)
+
b
o
=
[
z
1
(
t
)
,
z
2
(
t
)
,
z
3
(
t
)
,
z
4
(
t
)
,
z
5
(
t
)
]
(
1.4
)
\begin{aligned} s^{(t)} &=tanh( Ux^{(t)} + Ws^{(t-1)} + b_s) \hspace9ex &&(1.1)\\ a^{(t)} &= Ux^{(t)} + Ws^{(t-1)} + b_s \hspace9ex &&(1.2) \\ o^{(t)} &=softmax( Vs^{(t)} + b_o) =[ o^{(t)}_1, o^{(t)}_2,o^{(t)}_3,o^{(t)}_4,o^{(t)}_5] \hspace9ex &&(1.3)\\ z^{(t)} &= Vs^{(t)} + b_o = [ z^{(t)}_1, z^{(t)}_2,z^{(t)}_3,z^{(t)}_4,z^{(t)}_5 ] \hspace9ex &&(1.4) \end{aligned}
s(t)a(t)o(t)z(t)=tanh(Ux(t)+Ws(t−1)+bs)=Ux(t)+Ws(t−1)+bs=softmax(Vs(t)+bo)=[o1(t),o2(t),o3(t),o4(t),o5(t)]=Vs(t)+bo=[z1(t),z2(t),z3(t),z4(t),z5(t)](1.1)(1.2)(1.3)(1.4)
o
i
(
t
)
=
e
x
p
(
z
i
(
t
)
)
∑
j
e
x
p
(
z
j
(
t
)
)
(
1.5
)
o^{(t)}_i = \frac{ exp(z^{(t)}_i) }{ \sum_j exp(z^{(t)}_j) } \hspace9ex (1.5)
oi(t)=∑jexp(zj(t))exp(zi(t))(1.5)
其中
o
i
(
t
)
o^{(t)}_i
oi(t)表示
x
(
t
)
x^{(t)}
x(t)属于第
i
i
i个类别的概率,
∑
j
o
j
(
t
)
=
1
\sum_j o^{(t)}_j =1
∑joj(t)=1,有了上面的定义和前向传播,我们就可以使用反向传播来更新参数。在反向传播计算之前我们需要确定损失函数,RNN的损失函数是每个时刻的误差之和,我们希望总的误差最小。在神经网络的多分类模型中,我们一般使用softmax层与负对数似然损失。
E
=
∑
t
E
t
(
1.6
)
E
t
=
−
∑
i
=
1
H
y
i
(
t
)
log
o
i
(
t
)
=
−
log
o
y
(
t
)
=
1
(
t
)
(
1.7
)
\begin{aligned} E &= \sum_t E_t \hspace9ex &&(1.6)\\ E_t &= - \sum_{i=1}^H y_i^{(t)} \log o_i^{(t)}=-\log o^{(t)}_{y_{(t)=1}} \hspace9ex &&(1.7) \end{aligned}
EEt=t∑Et=−i=1∑Hyi(t)logoi(t)=−logoy(t)=1(t)(1.6)(1.7)
其中
o
y
(
t
)
=
1
(
t
)
o^{(t)}_{y_{(t)=1}}
oy(t)=1(t)表示,取
o
(
t
)
o^{(t)}
o(t)中下标为
y
(
t
)
y^{(t)}
y(t)中元素为1对应下标的位置,这句话简单的来说就是,如果
y
(
t
)
y^{(t)}
y(t)中第2个元素的值为1,则取
o
y
(
t
)
=
1
(
t
)
=
o
2
(
t
)
o^{(t)}_{y_{(t)=1}}=o^{(t)}_2
oy(t)=1(t)=o2(t),注意
y
(
t
)
y^{(t)}
y(t)是one-hot编码的向量,
y
(
t
)
y^{(t)}
y(t)中有且只有一个元素的值为1,其余的值全为0。
反向传播
RNN的反向传播算法是Backpropagation Through Time (BPTT),它的基本原理和BP算法是一样的,可以分为下面三个步骤:
- 向计算每个神经元的输出值;
- 反向计算每个神经元的误差项 δ ( t ) \delta^{(t)} δ(t);
- 计算每个权重的梯度,更新梯度。
t
t
t时刻的输出层第
j
j
j个神经元的误差项我们用
δ
o
j
(
t
)
\delta_{oj}^{(t)}
δoj(t)表示:
δ
o
j
(
t
)
=
∂
E
t
∂
z
j
(
t
)
=
∑
i
=
1
C
∂
E
t
∂
o
i
(
t
)
∂
o
i
(
t
)
∂
z
j
(
t
)
=
−
∑
i
=
1
C
∂
∑
k
=
1
H
y
k
(
t
)
log
o
k
(
t
)
∂
o
i
(
t
)
∂
o
i
(
t
)
∂
z
j
(
t
)
=
−
∑
i
=
1
C
y
i
(
t
)
o
i
(
t
)
∂
o
i
(
t
)
∂
z
j
(
t
)
(
2.1
)
\begin{aligned} \delta^{(t)}_{oj} &= \frac{ \partial E_t}{ \partial z^{(t)}_j } = \sum_{i=1}^C \frac{ \partial E_t}{ \partial o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \sum_{i=1}^C \frac{ \partial \sum_{k=1}^H y_k^{(t)} \log o_k^{(t)} }{ \partial o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \sum_{i=1}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \hspace9ex &&(2.1) \end{aligned}
δoj(t)=∂zj(t)∂Et=i=1∑C∂oi(t)∂Et∂zj(t)∂oi(t)=−i=1∑C∂oi(t)∂∑k=1Hyk(t)logok(t)∂zj(t)∂oi(t)=−i=1∑Coi(t)yi(t)∂zj(t)∂oi(t)(2.1)
对于式(2.1)当
i
=
j
i=j
i=j时:
∂
o
i
(
t
)
∂
z
j
(
t
)
=
∂
(
e
x
p
(
z
j
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
)
/
∂
z
j
(
t
)
=
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
D
[
e
x
p
(
z
j
(
t
)
)
]
−
e
x
p
(
z
j
(
t
)
)
D
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
2
=
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
e
x
p
(
z
j
(
t
)
)
−
e
x
p
(
z
j
(
t
)
)
e
x
p
(
z
j
(
t
)
)
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
2
=
e
x
p
(
z
j
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
−
e
x
p
(
z
j
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
=
o
j
(
t
)
(
1
−
o
j
(
t
)
)
(
2.2
)
\begin{aligned} \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } &= \partial \left( \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \right) / \partial z^{(t)}_j \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] \boldsymbol{D} [exp(z^{(t)}_j)] - exp(z^{(t)}_j) \boldsymbol{D} [\sum_k exp(z^{(t)}_k) ]}{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] exp(z^{(t)}_j) - exp(z^{(t)}_j) exp(z^{(t)}_j) }{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \frac{ \sum_k exp(z^{(t)}_k) - exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \\ &= o_j^{(t)} (1- o_j^{(t)}) \hspace9ex &&(2.2) \end{aligned}
∂zj(t)∂oi(t)=∂(∑kexp(zk(t))exp(zj(t)))/∂zj(t)=[∑kexp(zk(t))]2[∑kexp(zk(t))]D[exp(zj(t))]−exp(zj(t))D[∑kexp(zk(t))]=[∑kexp(zk(t))]2[∑kexp(zk(t))]exp(zj(t))−exp(zj(t))exp(zj(t))=∑kexp(zk(t))exp(zj(t))∑kexp(zk(t))∑kexp(zk(t))−exp(zj(t))=oj(t)(1−oj(t))(2.2)
对于式(2.1)当
i
≠
j
i \ne j
i=j时:
∂
o
i
(
t
)
∂
z
j
(
t
)
=
∂
(
e
x
p
(
z
i
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
)
/
∂
z
j
(
t
)
=
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
D
[
e
x
p
(
z
i
(
t
)
)
]
−
e
x
p
(
z
i
(
t
)
)
D
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
2
=
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
∗
0
−
e
x
p
(
z
i
(
t
)
)
e
x
p
(
z
j
(
t
)
)
[
∑
k
e
x
p
(
z
k
(
t
)
)
]
2
=
−
e
x
p
(
z
i
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
e
x
p
(
z
j
(
t
)
)
∑
k
e
x
p
(
z
k
(
t
)
)
=
−
o
i
(
t
)
o
j
(
t
)
(
2.3
)
\begin{aligned} \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } &= \partial \left( \frac{ exp(z^{(t)}_i) }{ \sum_k exp(z^{(t)}_k) } \right) / \partial z^{(t)}_j \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] \boldsymbol{D} [exp(z^{(t)}_i)] - exp(z^{(t)}_i) \boldsymbol{D} [\sum_k exp(z^{(t)}_k) ]}{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= \frac{ [\sum_k exp(z^{(t)}_k) ] *0 - exp(z^{(t)}_i) exp(z^{(t)}_j) }{ [\sum_k exp(z^{(t)}_k) ]^2} \\ &= -\frac{ exp(z^{(t)}_i) }{ \sum_k exp(z^{(t)}_k) } \frac{ exp(z^{(t)}_j) }{ \sum_k exp(z^{(t)}_k) } \\ &= - o_i^{(t)} o_j^{(t)} \hspace9ex &&(2.3) \end{aligned}
∂zj(t)∂oi(t)=∂(∑kexp(zk(t))exp(zi(t)))/∂zj(t)=[∑kexp(zk(t))]2[∑kexp(zk(t))]D[exp(zi(t))]−exp(zi(t))D[∑kexp(zk(t))]=[∑kexp(zk(t))]2[∑kexp(zk(t))]∗0−exp(zi(t))exp(zj(t))=−∑kexp(zk(t))exp(zi(t))∑kexp(zk(t))exp(zj(t))=−oi(t)oj(t)(2.3)
综上,且
y
(
t
)
y^{(t)}
y(t)是one-hot编码,有
∑
i
y
i
(
t
)
=
1
\sum_i y^{(t)}_i = 1
∑iyi(t)=1,所以输出层第
j
j
j个神经元的误差项
δ
o
j
(
t
)
\delta_{oj}^{(t)}
δoj(t):
δ
o
j
(
t
)
=
−
∑
i
=
1
C
y
i
(
t
)
o
i
(
t
)
∂
o
i
(
t
)
∂
z
j
(
t
)
=
−
y
j
(
t
)
o
j
(
t
)
∂
o
j
(
t
)
∂
z
j
(
t
)
−
∑
i
=
1
,
i
≠
j
C
y
i
(
t
)
o
i
(
t
)
∂
o
i
(
t
)
∂
z
j
(
t
)
=
−
y
j
(
t
)
o
j
(
t
)
o
j
(
t
)
(
1
−
o
j
(
t
)
)
−
∑
i
=
1
,
i
≠
j
C
y
i
(
t
)
o
i
(
t
)
(
−
o
i
(
t
)
o
j
(
t
)
)
=
y
j
(
t
)
(
o
j
(
t
)
−
1
)
+
∑
i
=
1
,
i
≠
j
C
y
i
(
t
)
o
j
(
t
)
=
∑
i
=
1
C
y
i
(
t
)
o
j
(
t
)
−
y
j
(
t
)
=
o
j
(
t
)
−
y
j
(
t
)
(
2.4
)
\begin{aligned} \delta^{(t)}_{oj} &= - \sum_{i=1}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \frac{ y_j^{(t)} }{ o_j^{(t)} } \frac{ \partial o_j^{(t)} }{ \partial z^{(t)}_j } - \sum_{i=1,i \ne j}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } \frac{ \partial o_i^{(t)} }{ \partial z^{(t)}_j } \\ &= - \frac{ y_j^{(t)} }{ o_j^{(t)} } o_j^{(t)} (1- o_j^{(t)}) - \sum_{i=1,i \ne j}^C \frac{ y_i^{(t)} }{ o_i^{(t)} } (- o_i^{(t)} o_j^{(t)}) \\ &= y_j^{(t)} ( o_j^{(t)} -1 ) + \sum_{i=1,i \ne j}^C y_i^{(t)} o_j^{(t)} \\ &= \sum_{i=1}^C y_i^{(t)} o_j^{(t)} - y_j^{(t)} = o_j^{(t)} - y_j^{(t)} \hspace9ex &&(2.4) \end{aligned}
δoj(t)=−i=1∑Coi(t)yi(t)∂zj(t)∂oi(t)=−oj(t)yj(t)∂zj(t)∂oj(t)−i=1,i=j∑Coi(t)yi(t)∂zj(t)∂oi(t)=−oj(t)yj(t)oj(t)(1−oj(t))−i=1,i=j∑Coi(t)yi(t)(−oi(t)oj(t))=yj(t)(oj(t)−1)+i=1,i=j∑Cyi(t)oj(t)=i=1∑Cyi(t)oj(t)−yj(t)=oj(t)−yj(t)(2.4)
式(2.4)式是输出层一个神经元的误差项,那么
t
t
t时刻整个输出层的误差项
δ
o
(
t
)
\delta^{(t)}_o
δo(t)表示为:
δ
o
(
t
)
=
o
(
t
)
−
y
(
t
)
(
2.5
)
\delta^{(t)}_o = o^{(t)} - y^{(t)} \hspace9ex (2.5)
δo(t)=o(t)−y(t)(2.5)
计算了输出层的误差项,我们接下来看隐藏层神经元
j
j
j的误差项
δ
h
j
(
t
)
\delta_{hj}^{(t)}
δhj(t),隐藏层神经元的误差项的计算比输出层稍微复杂些,分为两种情况:
- 在最后时刻 T T T,隐藏层神经元的误差项只来自于后一层(输出层);
- 在中间时刻 t t t,隐藏层神经元的误差来自于后一层和下一时刻 t + 1 t+1 t+1隐藏层神经元的误差之和。
第一种情况,在最后时刻
T
T
T,隐藏层神经元误差项的计算:
δ
h
j
(
T
)
=
∂
E
T
∂
a
j
(
T
)
=
∑
i
=
1
C
∂
E
T
∂
z
i
(
T
)
∂
z
i
(
T
)
∂
a
j
(
T
)
=
∑
i
=
1
C
∑
k
=
1
H
∂
E
T
∂
z
i
(
T
)
∂
z
i
(
T
)
∂
s
k
(
T
)
∂
s
k
(
T
)
∂
a
j
(
T
)
(
2.6
)
\begin{aligned} \delta^{(T)}_{hj} &= \frac{ \partial E_T}{ \partial a^{(T)}_j } = \sum_{i=1}^C \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \sum_{k=1}^H \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_k } \frac{ \partial s_k^{(T)} }{ \partial a^{(T)}_j } \hspace9ex &&(2.6) \end{aligned}
δhj(T)=∂aj(T)∂ET=i=1∑C∂zi(T)∂ET∂aj(T)∂zi(T)=i=1∑Ck=1∑H∂zi(T)∂ET∂sk(T)∂zi(T)∂aj(T)∂sk(T)(2.6)
由式(1.1):
s
(
t
)
=
t
a
n
h
(
U
x
(
t
)
+
W
s
(
t
−
1
)
+
b
s
)
=
t
a
n
h
(
a
(
t
)
)
s^{(t)} =tanh( Ux^{(t)} + Ws^{(t-1)} + b_s) = tanh(a^{(t)})
s(t)=tanh(Ux(t)+Ws(t−1)+bs)=tanh(a(t))可得任意时刻
t
t
t:
∂
s
k
(
t
)
∂
a
j
(
t
)
=
∂
t
a
n
h
(
a
k
(
t
)
)
∂
a
j
(
t
)
=
{
1
−
s
j
(
t
)
2
if
k
=
j
0
others
=
I
(
k
=
j
)
(
1
−
s
j
(
t
)
2
)
(
2.7
)
\begin{aligned} \frac{ \partial s_k^{(t)} }{ \partial a^{(t)}_j } &= \frac{ \partial tanh(a_k^{(t)}) }{ \partial a_j^{(t)} } = \begin{cases} 1-s_j^{(t)2} & \text{if $k=j$ } \\ 0 & \text{others} \end{cases} = I(k=j) (1-s^{(t)2}_j) \hspace9ex &&(2.7) \end{aligned}
∂aj(t)∂sk(t)=∂aj(t)∂tanh(ak(t))={1−sj(t)20if k=j others=I(k=j)(1−sj(t)2)(2.7)
其中
I
(
k
=
j
)
I(k=j)
I(k=j)是一个指示函数,条件
k
=
j
k=j
k=j为真时取1,否则取0。
由式(1.4):
z
(
t
)
=
V
s
(
t
)
+
b
o
z^{(t)} = Vs^{(t)} + b_o
z(t)=Vs(t)+bo可得任意时刻
t
t
t:
∂
z
i
(
t
)
∂
s
k
(
t
)
=
∂
V
i
∙
s
(
t
)
+
b
o
i
∂
s
k
(
t
)
=
V
i
k
(
2.8
)
\begin{aligned} \frac{ \partial z_i^{(t)} }{ \partial s^{(t)}_k } &= \frac{ \partial V_{i \bullet}s^{(t)} + b_{oi} }{ \partial s^{(t)}_k } = V_{ik} \hspace9ex &&(2.8) \end{aligned}
∂sk(t)∂zi(t)=∂sk(t)∂Vi∙s(t)+boi=Vik(2.8)
其中
V
i
∙
V_{i \bullet}
Vi∙表示矩阵
V
V
V的第
i
i
i行。将式(2.7),(2.8)代入(2.6)得:
δ
h
j
(
T
)
=
∑
i
=
1
C
∑
k
=
1
H
∂
E
T
∂
z
i
(
T
)
∂
z
i
(
T
)
∂
s
k
(
T
)
∂
s
k
(
T
)
∂
a
j
(
T
)
=
∑
i
=
1
C
∂
E
T
∂
z
i
(
T
)
∂
z
i
(
T
)
∂
s
j
(
T
)
∂
s
j
(
T
)
∂
a
j
(
T
)
=
∑
i
=
1
C
δ
o
i
(
T
)
V
i
j
(
1
−
s
j
(
T
)
2
)
=
(
1
−
s
j
(
T
)
2
)
∑
i
=
1
C
δ
o
i
(
T
)
V
i
j
=
(
1
−
s
j
(
T
)
2
)
[
V
∙
j
]
T
δ
o
(
T
)
(
2.9
)
\begin{aligned} \delta^{(T)}_{hj} &= \sum_{i=1}^C \sum_{k=1}^H \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_k } \frac{ \partial s_k^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \frac{ \partial E_T}{ \partial z_i^{(T)} } \frac{ \partial z_i^{(T)} }{ \partial s^{(T)}_j } \frac{ \partial s_j^{(T)} }{ \partial a^{(T)}_j } \\ &= \sum_{i=1}^C \delta^{(T)}_{oi} V_{ij} (1-s^{(T)2}_j) \\ &= (1-s^{(T)2}_j) \sum_{i=1}^C \delta^{(T)}_{oi} V_{ij} \\ &= (1-s^{(T)2}_j) [V_{\bullet j}]^T \delta^{(T)}_o \hspace9ex &&(2.9) \end{aligned}
δhj(T)=i=1∑Ck=1∑H∂zi(T)∂ET∂sk(T)∂zi(T)∂aj(T)∂sk(T)=i=1∑C∂zi(T)∂ET∂sj(T)∂zi(T)∂aj(T)∂sj(T)=i=1∑Cδoi(T)Vij(1−sj(T)2)=(1−sj(T)2)i=1∑Cδoi(T)Vij=(1−sj(T)2)[V∙j]Tδo(T)(2.9)
第二种情况,在中间时刻
t
t
t,隐藏层的误差计算如下:
δ
h
j
(
t
)
=
∂
E
t
∂
a
j
(
t
)
+
∑
l
=
t
+
1
T
∑
k
=
1
H
∂
E
l
∂
a
k
(
t
+
1
)
∂
a
k
(
t
+
1
)
∂
a
j
(
t
)
(
2.10
)
\begin{aligned} \delta^{(t)}_{hj} &= \frac{ \partial E_t}{ \partial a^{(t)}_j } + \sum_{l=t+1}^T \sum_{k=1}^H \frac{ \partial E_l}{ \partial a_k^{(t+1)} } \frac{ \partial a_k^{(t+1)} }{ \partial a^{(t)}_j } \hspace9ex &&(2.10) \end{aligned}
δhj(t)=∂aj(t)∂Et+l=t+1∑Tk=1∑H∂ak(t+1)∂El∂aj(t)∂ak(t+1)(2.10)
再计算时刻t之后各个时刻的隐藏层对当前时刻的影响:
∑
l
=
t
+
1
T
∑
k
=
1
H
∂
E
l
∂
a
k
(
t
+
1
)
∂
a
k
(
t
+
1
)
∂
a
j
(
t
)
=
∑
k
=
1
H
δ
h
k
(
t
+
1
)
∂
a
k
(
t
+
1
)
∂
a
j
(
t
)
=
∑
k
=
1
H
δ
h
k
(
t
+
1
)
∂
a
k
(
t
+
1
)
∂
s
j
(
t
)
∂
s
j
(
t
)
∂
a
j
(
t
)
=
∑
k
=
1
H
δ
h
k
(
t
+
1
)
W
k
j
(
1
−
s
j
(
t
)
2
)
=
[
W
∙
j
]
T
δ
h
(
t
+
1
)
(
1
−
s
j
(
t
)
2
)
(
2.11
)
\begin{aligned} \sum_{l=t+1}^T \sum_{k=1}^H \frac{ \partial E_l}{ \partial a_k^{(t+1)} } \frac{ \partial a_k^{(t+1)} }{ \partial a^{(t)}_j } &= \sum_{k=1}^H \delta^{(t+1)}_{hk} \frac{ \partial a^{(t+1)}_k }{ \partial a^{(t)}_j } \\ &= \sum_{k=1}^H \delta^{(t+1)}_{hk} \frac{ \partial a^{(t+1)}_k }{ \partial s^{(t)}_j } \frac{ \partial s^{(t)}_j }{ \partial a^{(t)}_j } \\ &= \sum_{k=1}^H \delta^{(t+1)}_{hk} W_{kj} (1-s^{(t)2}_j) \\ &= [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.11) \end{aligned}
l=t+1∑Tk=1∑H∂ak(t+1)∂El∂aj(t)∂ak(t+1)=k=1∑Hδhk(t+1)∂aj(t)∂ak(t+1)=k=1∑Hδhk(t+1)∂sj(t)∂ak(t+1)∂aj(t)∂sj(t)=k=1∑Hδhk(t+1)Wkj(1−sj(t)2)=[W∙j]Tδh(t+1)(1−sj(t)2)(2.11)
注意(2.11)中的第一个等号可能不太好理解,
∑
l
=
t
+
1
T
∂
E
l
∂
a
k
(
t
+
1
)
=
δ
h
k
(
t
+
1
)
\sum_{l=t+1}^T \frac{ \partial E_l}{ \partial a_k^{(t+1)} } =\delta^{(t+1)}_{hk}
∑l=t+1T∂ak(t+1)∂El=δhk(t+1),这是根据隐藏层误差项的定义:从结束时刻T到当前时刻t,每个时刻的
E
t
E_t
Et关于加权输入
a
k
(
t
+
1
)
a_k^{(t+1)}
ak(t+1)的偏导数定义为
δ
h
k
(
t
+
1
)
\delta^{(t+1)}_{hk}
δhk(t+1)。将(2.11)代入(2.10)可得:
δ
h
j
(
t
)
=
(
1
−
s
j
(
t
)
2
)
[
V
∙
j
]
T
δ
o
(
t
)
+
[
W
∙
j
]
T
δ
h
(
t
+
1
)
(
1
−
s
j
(
t
)
2
)
(
2.12
)
\begin{aligned} \delta^{(t)}_{hj} &= (1-s^{(t)2}_j) [V_{\bullet j}]^T \delta^{(t)}_o + [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.12) \end{aligned}
δhj(t)=(1−sj(t)2)[V∙j]Tδo(t)+[W∙j]Tδh(t+1)(1−sj(t)2)(2.12)
综上,令
δ
(
T
+
1
)
=
0
⃗
\delta^{(T+1)}=\vec 0
δ(T+1)=0,则对于任意时刻
t
t
t隐藏层神经元
j
j
j的误差项表达如下:
δ
h
j
(
t
)
=
(
1
−
s
j
(
t
)
2
)
[
V
∙
j
]
T
δ
o
(
t
)
+
[
W
∙
j
]
T
δ
h
(
t
+
1
)
(
1
−
s
j
(
t
)
2
)
(
2.13
)
δ
h
(
t
)
=
V
T
δ
o
(
t
)
⊙
(
1
−
s
(
t
)
2
)
+
W
T
δ
h
(
t
+
1
)
⊙
(
1
−
s
(
t
)
2
)
(
2.14
)
\begin{aligned} \delta^{(t)}_{hj} &= (1-s^{(t)2}_j) [V_{\bullet j}]^T \delta^{(t)}_o + [W_{\bullet j}]^T \delta^{(t+1)}_{h} (1-s^{(t)2}_j) \hspace9ex &&(2.13) \\ \delta^{(t)}_{h} &= V^T \delta^{(t)}_o \odot (1-s^{(t)2}) + W^T \delta^{(t+1)}_{h} \odot (1-s^{(t)2}) \hspace9ex &&(2.14) \end{aligned}
δhj(t)δh(t)=(1−sj(t)2)[V∙j]Tδo(t)+[W∙j]Tδh(t+1)(1−sj(t)2)=VTδo(t)⊙(1−s(t)2)+WTδh(t+1)⊙(1−s(t)2)(2.13)(2.14)
经过漫长的计算,现在已经将每个神经元的误差项计算出来了,下面将利用误差项计算梯度。
V
V
V只能通过影响当前时刻的输出来影响误差,而
W
W
W能影响当前时刻和以后各个时刻的输出。
∂
E
t
∂
V
j
i
=
∂
E
t
∂
z
j
(
t
)
∂
z
j
(
t
)
∂
V
j
i
=
δ
o
j
(
t
)
s
i
(
t
)
(
2.15
)
∂
E
t
∂
W
j
i
=
∑
l
=
t
T
∂
E
l
∂
a
j
(
t
)
∂
a
j
(
t
)
∂
W
j
i
=
δ
h
j
(
t
)
s
i
(
t
−
1
)
(
2.16
)
∂
E
t
∂
U
j
i
=
∑
l
=
t
T
∂
E
l
∂
a
j
(
t
)
∂
a
j
(
t
)
∂
U
j
i
=
δ
h
j
(
t
)
x
i
(
t
)
(
2.17
)
∂
E
t
∂
b
s
j
=
∑
l
=
t
T
∂
E
l
∂
a
j
(
t
)
∂
a
j
(
t
)
∂
b
s
j
=
δ
h
j
(
t
)
(
2.18
)
∂
E
t
∂
b
o
j
=
∂
E
t
∂
z
j
(
t
)
∂
z
j
(
t
)
∂
b
o
j
=
δ
o
j
(
t
)
(
2.19
)
\begin{aligned} \frac{\partial E_t}{ \partial V_{ji} } &= \frac{\partial E_t}{ \partial z_j^{(t)} } \frac{\partial z_j^{(t)} }{ \partial V_{ji} } = \delta^{(t)}_{oj} s_i^{(t)} \hspace9ex &&(2.15) \\ \frac{\partial E_t}{ \partial W_{ji} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial W_{ji} } = \delta^{(t)}_{hj} s_i^{(t-1)} \hspace9ex &&(2.16) \\ \frac{\partial E_t}{ \partial U_{ji} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial U_{ji} } = \delta^{(t)}_{hj} x_i^{(t)} \hspace9ex &&(2.17) \\ \frac{\partial E_t}{ \partial b_{sj} } &= \sum_{l=t}^T \frac{\partial E_l}{ \partial a_j^{(t)} } \frac{\partial a_j^{(t)} }{ \partial b_{sj} } = \delta^{(t)}_{hj} \hspace9ex &&(2.18) \\ \frac{\partial E_t}{ \partial b_{oj} } &= \frac{\partial E_t}{ \partial z_j^{(t)} } \frac{\partial z_j^{(t)} }{ \partial b_{oj} } = \delta^{(t)}_{oj} \hspace9ex &&(2.19) \end{aligned}
∂Vji∂Et∂Wji∂Et∂Uji∂Et∂bsj∂Et∂boj∂Et=∂zj(t)∂Et∂Vji∂zj(t)=δoj(t)si(t)=l=t∑T∂aj(t)∂El∂Wji∂aj(t)=δhj(t)si(t−1)=l=t∑T∂aj(t)∂El∂Uji∂aj(t)=δhj(t)xi(t)=l=t∑T∂aj(t)∂El∂bsj∂aj(t)=δhj(t)=∂zj(t)∂Et∂boj∂zj(t)=δoj(t)(2.15)(2.16)(2.17)(2.18)(2.19)
式(2.14)-(2.18)是各个时刻权重标量的的梯度,下面我们将其矢量化:
∂
E
t
∂
V
=
[
δ
o
1
(
t
)
s
1
(
t
)
δ
o
1
(
t
)
s
2
(
t
)
⋯
δ
o
1
(
t
)
s
H
(
t
)
δ
o
2
(
t
)
s
1
(
t
)
δ
o
2
(
t
)
s
2
(
t
)
⋯
δ
o
2
(
t
)
s
H
(
t
)
⋮
⋮
⋮
⋮
δ
o
C
(
t
)
s
1
(
t
)
δ
o
C
(
t
)
s
2
(
t
)
⋯
δ
o
C
(
t
)
s
H
(
t
)
]
=
δ
o
(
t
)
⊗
s
(
t
)
=
(
o
(
t
)
−
y
(
t
)
)
⊗
s
(
t
)
(
2.20
)
∂
E
t
∂
W
=
[
δ
h
1
(
t
)
s
1
(
t
−
1
)
δ
h
1
(
t
)
s
2
(
t
−
1
)
⋯
δ
h
1
(
t
)
s
H
(
t
−
1
)
δ
h
2
(
t
)
s
1
(
t
−
1
)
δ
h
2
(
t
)
s
2
(
t
−
1
)
⋯
δ
h
2
(
t
)
s
H
(
t
−
1
)
⋮
⋮
⋮
⋮
δ
h
H
(
t
)
s
1
(
t
−
1
)
δ
h
H
(
t
)
s
2
(
t
−
1
)
⋯
δ
h
H
(
t
)
s
H
(
t
−
1
)
]
=
δ
h
(
t
)
⊗
s
(
t
−
1
)
=
[
V
T
δ
o
(
t
)
⊙
(
1
−
s
(
t
)
2
)
+
W
T
δ
h
(
t
+
1
)
⊙
(
1
−
s
(
t
)
2
)
]
⊗
s
(
t
−
1
)
(
2.21
)
∂
E
t
∂
U
=
[
δ
h
1
(
t
)
x
1
(
t
)
δ
h
1
(
t
)
x
2
(
t
)
⋯
δ
h
1
(
t
)
x
C
(
t
)
δ
h
2
(
t
)
x
1
(
t
)
δ
h
2
(
t
)
x
2
(
t
)
⋯
δ
h
2
(
t
)
x
C
(
t
)
⋮
⋮
⋮
⋮
δ
h
H
(
t
)
x
1
(
t
)
δ
h
H
(
t
)
x
2
(
t
)
⋯
δ
h
H
(
t
)
x
C
(
t
)
]
=
δ
h
(
t
)
⊗
x
(
t
)
(
2.22
)
∂
E
t
∂
b
s
=
δ
h
(
t
)
(
2.23
)
∂
E
t
∂
b
o
=
δ
o
(
t
)
(
2.24
)
\begin{aligned} \frac{\partial E_t}{ \partial V } &= \begin{bmatrix} \delta_{o1}^{(t)} s_1^{(t)} & \delta_{o1}^{(t)} s_2^{(t)} & \cdots & \delta_{o1}^{(t)} s_H^{(t)} \\ \delta_{o2}^{(t)} s_1^{(t)} & \delta_{o2}^{(t)} s_2^{(t)} & \cdots & \delta_{o2}^{(t)} s_H^{(t)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{oC}^{(t)} s_1^{(t)} & \delta_{oC}^{(t)} s_2^{(t)} & \cdots & \delta_{oC}^{(t)} s_H^{(t)} \end{bmatrix} = \delta_o^{(t)} \otimes s^{(t)} = ( o^{(t)} - y^{(t)}) \otimes s^{(t)} \hspace9ex &&(2.20) \\ \frac{\partial E_t}{ \partial W } &= \begin{bmatrix} \delta_{h1}^{(t)} s_1^{(t-1)} & \delta_{h1}^{(t)} s_2^{(t-1)} & \cdots & \delta_{h1}^{(t)} s_H^{(t-1)} \\ \delta_{h2}^{(t)} s_1^{(t-1)} & \delta_{h2}^{(t)} s_2^{(t-1)} & \cdots & \delta_{h2}^{(t)} s_H^{(t-1)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{hH}^{(t)} s_1^{(t-1)} & \delta_{hH}^{(t)} s_2^{(t-1)} & \cdots & \delta_{hH}^{(t)} s_H^{(t-1)} \end{bmatrix} = \delta_h^{(t)} \otimes s^{(t-1)} \\ &=[V^T \delta^{(t)}_o \odot (1-s^{(t)2}) + W^T \delta^{(t+1)}_{h} \odot (1-s^{(t)2})] \otimes s^{(t-1)} \hspace9ex &&(2.21) \\ \frac{\partial E_t}{ \partial U} &= \begin{bmatrix} \delta_{h1}^{(t)} x_1^{(t)} & \delta_{h1}^{(t)} x_2^{(t)} & \cdots & \delta_{h1}^{(t)} x_C^{(t)} \\ \delta_{h2}^{(t)} x_1^{(t)} & \delta_{h2}^{(t)} x_2^{(t)} & \cdots & \delta_{h2}^{(t)} x_C^{(t)} \\ \vdots & \vdots & \vdots & \vdots \\ \delta_{hH}^{(t)} x_1^{(t)} & \delta_{hH}^{(t)} x_2^{(t)} & \cdots & \delta_{hH}^{(t)} x_C^{(t)} \end{bmatrix} = \delta_h^{(t)} \otimes x^{(t)} \hspace9ex &&(2.22) \\ \frac{\partial E_t}{ \partial b_s } & = \delta^{(t)}_h \hspace9ex &&(2.23) \\ \frac{\partial E_t}{ \partial b_o } &= \delta^{(t)}_o \hspace9ex &&(2.24) \end{aligned}
∂V∂Et∂W∂Et∂U∂Et∂bs∂Et∂bo∂Et=⎣⎢⎢⎢⎢⎡δo1(t)s1(t)δo2(t)s1(t)⋮δoC(t)s1(t)δo1(t)s2(t)δo2(t)s2(t)⋮δoC(t)s2(t)⋯⋯⋮⋯δo1(t)sH(t)δo2(t)sH(t)⋮δoC(t)sH(t)⎦⎥⎥⎥⎥⎤=δo(t)⊗s(t)=(o(t)−y(t))⊗s(t)=⎣⎢⎢⎢⎢⎡δh1(t)s1(t−1)δh2(t)s1(t−1)⋮δhH(t)s1(t−1)δh1(t)s2(t−1)δh2(t)s2(t−1)⋮δhH(t)s2(t−1)⋯⋯⋮⋯δh1(t)sH(t−1)δh2(t)sH(t−1)⋮δhH(t)sH(t−1)⎦⎥⎥⎥⎥⎤=δh(t)⊗s(t−1)=[VTδo(t)⊙(1−s(t)2)+WTδh(t+1)⊙(1−s(t)2)]⊗s(t−1)=⎣⎢⎢⎢⎢⎡δh1(t)x1(t)δh2(t)x1(t)⋮δhH(t)x1(t)δh1(t)x2(t)δh2(t)x2(t)⋮δhH(t)x2(t)⋯⋯⋮⋯δh1(t)xC(t)δh2(t)xC(t)⋮δhH(t)xC(t)⎦⎥⎥⎥⎥⎤=δh(t)⊗x(t)=δh(t)=δo(t)(2.20)(2.21)(2.22)(2.23)(2.24)
其中,
⊗
\otimes
⊗表示外积,
⊙
\odot
⊙表示点积即矩阵对应位置相乘。到此,RNN的公式推导结束。