目录
1.网络结构
全连接网络和卷积网络都属于前向反馈网络,模型的输出和模型本身没有关联。而循环神经网络的输出和模型间有反馈。
循环神经网络的输出和模型之间之所以有反馈,就是因为其引入了记忆体的概念。
t
t
t时刻的输入为
x
t
x^{t}
xt,输出为
y
t
y^{t}
yt,记忆体为
h
t
h^{t}
ht。其中,
h
t
h^{t}
ht由
x
t
x^{t}
xt和上一时刻的记忆体
h
t
−
1
h^{t-1}
ht−1求得,
y
t
y^{t}
yt由记忆体
h
t
h^{t}
ht求得。
t
t
t时刻网络结构图如下所示:
将循环网络按时间步
t
t
t展开:
在前向传播的过程中,主要更新记忆体
h
t
h^{t}
ht和输出
y
t
y^{t}
yt,参数矩阵
W
\mathbf{W}
W是不变的。
在反向传播中,主要用梯度下降来更新参数矩阵
W
\mathbf{W}
W。
循环神经网络借助循环核对时间特征进行提取,然后将提取的特征送入全连接网络进行预测。
循环核的数量是可以随意指定的,如下图:
由于记忆体的存在,循环神经网络保留了历史信息,常用来处理语音、文字等序列相关的信息。
2.前向传播
(1)记忆体 h t h^{t} ht
t
t
t时刻的隐藏状态
h
t
h^{t}
ht由
t
t
t时刻的输入
x
t
x^{t}
xt和
t
−
1
t-1
t−1时刻的隐藏状态
h
t
−
1
h^{t-1}
ht−1共同决定:
h
t
=
σ
(
U
x
t
+
W
h
t
−
1
+
b
)
h^{t}=\sigma (\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})
ht=σ(Uxt+Wht−1+b)
其中,
U
\mathbf{U}
U、
W
\mathbf{W}
W为全局共享的系数矩阵,
b
\mathbf{b}
b为偏置向量,$\sigma $为激活函数,一般为tanh。
(2)预测值 y ^ t \widehat{y}^{t} y t
y
^
t
=
σ
(
V
h
t
+
c
)
\widehat{y}^{t}=\sigma (\mathbf{V}h^{t}+\mathbf{c})
y
t=σ(Vht+c)
V
\mathbf{V}
V为全局共享的系数矩阵,
c
\mathbf{c}
c为偏置向量,$\sigma $为激活函数,一般为softmax。
3.反向传播
(1)损失函数
模型的整体损失定义为各个时刻的损失之和:
L
=
∑
t
=
1
T
L
t
L=\sum _{t=1}^{T}L^{t}
L=t=1∑TLt
t
t
t时刻的损失一般使用交叉熵损失:
L
t
=
−
(
y
t
)
T
log
y
^
t
L^{t}=-(\mathbf{y}^{t})^{T}\log \widehat{\mathbf{y}}^{t}
Lt=−(yt)Tlogy
t
其中,
y
t
\mathbf{y}^{t}
yt为真实值,
y
^
t
\widehat{\mathbf{y}}^{t}
y
t为预测值,它们都是one-hot向量。
已知预测值
y
^
t
\widehat{\mathbf{y}}^{t}
y
t的计算过程:
y
^
t
=
σ
(
V
h
t
+
c
)
\widehat{y}^{t}=\sigma (\mathbf{V}h^{t}+\mathbf{c})
y
t=σ(Vht+c)
为方便计算,引入中间变量:
o
t
=
V
h
t
+
c
o^{t}=\mathbf{V}h^{t}+\mathbf{c}
ot=Vht+c
则:
y
^
t
=
σ
(
o
t
)
=
S
o
f
t
m
a
x
(
o
t
)
=
e
o
t
1
k
T
e
o
t
\widehat{y}^{t}=\sigma (o^{t})=Softmax(o^{t})=\frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}}
y
t=σ(ot)=Softmax(ot)=1kTeoteot
上式中,
1
k
T
1_{k}^{T}
1kT为全
1
1
1向量,
K
K
K为类别数,则分母为标量;分子为向量,则预测值
y
^
t
\widehat{y}^{t}
y
t为向量,带入损失函数
L
t
L^{t}
Lt:
L
t
=
−
(
y
t
)
T
log
y
^
t
=
−
(
y
t
)
T
ln
e
o
t
1
k
T
e
o
t
=
−
(
y
t
)
T
ln
e
o
t
+
(
y
t
)
T
1
k
ln
1
k
T
e
o
t
=
ln
1
k
T
e
o
t
−
(
y
t
)
T
o
t
,
(
(
y
t
)
T
1
k
=
1
)
\begin{aligned} L^{t}&=-(\mathbf{y}^{t})^{T}\log \widehat{\mathbf{y}}^{t}\\ &= -(\mathbf{y}^{t})^{T}\ln \frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}}\\ & = -(\mathbf{y}^{t})^{T}\ln e^{o^{t}}+(\mathbf{y}^{t})^{T}1_{k}\ln 1_{k}^{T}e^{o^{t}}\\ & = \ln 1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}o^{t},((\mathbf{y}^{t})^{T}1_{k}=1) \end{aligned}
Lt=−(yt)Tlogy
t=−(yt)Tln1kTeoteot=−(yt)Tlneot+(yt)T1kln1kTeot=ln1kTeot−(yt)Tot,((yt)T1k=1)
上式中,
ln
1
k
T
e
o
t
,
(
y
t
)
T
o
t
\ln 1_{k}^{T}e^{o^{t}},(\mathbf{y}^{t})^{T}o^{t}
ln1kTeot,(yt)Tot都是标量,所以最终结果
L
t
L^{t}
Lt为标量。
所以,
t
t
t时刻的损失函数为:
L
t
=
ln
1
k
T
e
o
t
−
(
y
t
)
T
o
t
L^{t}= \ln 1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}o^{t}
Lt=ln1kTeot−(yt)Tot
(2) L t L^{t} Lt对 V , c \mathbf{V,c} V,c求梯度
标量对矩阵向量求导,使用矩阵微分和迹函数公式:
d
L
t
=
t
r
[
d
ln
1
k
T
e
o
t
−
d
(
y
t
)
T
o
t
]
=
t
r
[
1
1
k
T
e
o
t
d
1
k
T
e
o
t
−
(
y
t
)
T
d
o
t
]
=
t
r
[
1
k
T
e
o
t
1
k
T
e
o
t
⊙
d
o
t
−
(
y
t
)
T
d
o
t
]
=
t
r
[
(
1
k
T
⊙
e
o
t
)
T
1
k
T
e
o
t
d
o
t
−
(
y
t
)
T
d
o
t
]
=
t
r
[
(
e
o
t
)
T
1
k
T
e
o
t
d
o
t
−
(
y
t
)
T
d
o
t
]
=
t
r
[
(
(
y
^
t
)
T
−
(
y
t
)
T
)
d
(
V
h
t
+
c
)
]
,
(
y
^
t
=
e
o
t
1
k
T
e
o
t
)
=
t
r
[
h
t
(
(
y
^
t
)
T
−
(
y
t
)
T
)
d
V
−
(
(
y
^
t
)
T
−
(
y
t
)
T
)
d
c
]
\begin{aligned} dL^{t}&=tr[d\ln 1_{k}^{T}e^{o^{t}}-d(\mathbf{y}^{t})^{T}o^{t}] \\ &= tr[\frac{1}{1_{k}^{T}e^{o^{t}}}d1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[\frac{1_{k}^{T}e^{o^{t}}}{1_{k}^{T}e^{o^{t}}}\odot do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr [\frac{(1_{k}^{T}\odot e^{o^{t}})^{T}}{1_{k}^{T}e^{o^{t}}} do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[\frac{(e^{o^{t}})^{T}}{1_{k}^{T}e^{o^{t}}}do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})d(\mathbf{V}h^{t}+c)],(\widehat{y}^{t}=\frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}})\\ & = tr[h^{t}((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})d\mathbf{V}-((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})dc] \end{aligned}
dLt=tr[dln1kTeot−d(yt)Tot]=tr[1kTeot1d1kTeot−(yt)Tdot]=tr[1kTeot1kTeot⊙dot−(yt)Tdot]=tr[1kTeot(1kT⊙eot)Tdot−(yt)Tdot]=tr[1kTeot(eot)Tdot−(yt)Tdot]=tr[((y
t)T−(yt)T)d(Vht+c)],(y
t=1kTeoteot)=tr[ht((y
t)T−(yt)T)dV−((y
t)T−(yt)T)dc]
最终,通过矩阵微分和导数的关系可得:
∂
L
∂
V
=
∑
t
=
1
T
∂
L
t
∂
V
=
∑
t
=
1
T
[
h
t
(
(
y
^
t
)
T
−
(
y
t
)
T
)
]
T
=
∑
t
=
1
T
(
y
^
t
−
y
t
)
(
h
t
)
T
∂
L
∂
c
=
∑
t
=
1
T
∂
L
t
∂
c
=
∑
t
=
1
T
[
(
y
^
t
)
T
−
(
y
t
)
T
]
T
=
∑
t
=
1
T
(
y
^
t
−
y
t
)
\frac{\partial L}{\partial \mathbf{V}}=\sum_{t=1}^{T}\frac{\partial L^{t}}{\partial \mathbf{V}}=\sum_{t=1}^{T}[h^{t}((\widehat{\mathbf{y}}^{t})^{T}-(\mathbf{y}^{t})^{T})]^{T}=\sum_{t=1}^{T}(\widehat{\mathbf{y}}^{t}-\mathbf{y}^{t})(h^{t})^{T}\\ \frac{\partial L}{\partial \mathbf{c}}=\sum_{t=1}^{T}\frac{\partial L^{t}}{\partial \mathbf{c}}=\sum_{t=1}^{T}[(\widehat{\mathbf{y}}^{t})^{T}-(\mathbf{y}^{t})^{T}]^{T}=\sum_{t=1}^{T}(\widehat{\mathbf{y}}^{t}-\mathbf{y}^{t})
∂V∂L=t=1∑T∂V∂Lt=t=1∑T[ht((y
t)T−(yt)T)]T=t=1∑T(y
t−yt)(ht)T∂c∂L=t=1∑T∂c∂Lt=t=1∑T[(y
t)T−(yt)T]T=t=1∑T(y
t−yt)
(3) L t L^{t} Lt对 W , U , b \mathbf{W,U,b} W,U,b求梯度
已知:
h
t
=
tanh
(
U
x
t
+
W
h
t
−
1
+
b
)
h^{t}=\tanh (\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})
ht=tanh(Uxt+Wht−1+b)
根据
tanh
\tanh
tanh函数和导数的关系,可得:
(
h
t
)
′
=
1
−
(
h
t
)
2
(h^{t})^{'}=1-(h^{t})^{2}
(ht)′=1−(ht)2
对
L
t
L^{t}
Lt求导,标量对向量求导,使用矩阵微分和迹函数公式:
d
L
t
=
t
r
[
(
∂
L
t
∂
h
t
)
T
d
tanh
(
U
x
t
+
W
h
t
−
1
+
b
)
]
=
t
r
[
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
(
U
x
t
+
W
h
t
−
1
+
b
)
]
=
t
r
[
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
(
U
)
x
t
+
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
(
W
)
h
t
−
1
+
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
b
]
=
t
r
[
x
t
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
U
+
h
t
−
1
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
W
+
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
d
b
]
\begin{aligned} dL^{t}&=tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}d\tanh(\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})] \\ &= tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})]\\ &= tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{U})x^{t}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{W})h^{t-1}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{b}]\\ &= tr[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{U}+h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{W}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{b}] \end{aligned}
dLt=tr[(∂ht∂Lt)Tdtanh(Uxt+Wht−1+b)]=tr[(∂ht∂Lt)Tdiag(1−(ht)2)d(Uxt+Wht−1+b)]=tr[(∂ht∂Lt)Tdiag(1−(ht)2)d(U)xt+(∂ht∂Lt)Tdiag(1−(ht)2)d(W)ht−1+(∂ht∂Lt)Tdiag(1−(ht)2)db]=tr[xt(∂ht∂Lt)Tdiag(1−(ht)2)dU+ht−1(∂ht∂Lt)Tdiag(1−(ht)2)dW+(∂ht∂Lt)Tdiag(1−(ht)2)db]
最终,通过矩阵微分和导数的关系可得:
∂
L
t
∂
U
=
[
x
t
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
∂
L
t
∂
h
t
(
x
t
)
T
∂
L
t
∂
W
=
[
h
t
−
1
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
∂
L
t
∂
h
t
(
h
t
−
1
)
T
∂
L
t
∂
b
=
[
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
∂
L
t
∂
h
t
\frac{\partial L^{t}}{\partial \mathbf{U}}=[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}}(x^{t})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{W}}=[h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}}(h^{t-1})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{b}}=[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}}
∂U∂Lt=[xt(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)∂ht∂Lt(xt)T∂W∂Lt=[ht−1(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)∂ht∂Lt(ht−1)T∂b∂Lt=[(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)∂ht∂Lt
以上三个梯度公式中都有
∂
L
t
∂
h
t
\frac{\partial L^{t}}{\partial h^{t}}
∂ht∂Lt,记公共项
∂
L
t
∂
h
t
\frac{\partial L^{t}}{\partial h^{t}}
∂ht∂Lt为误差项
δ
t
\delta ^{t}
δt,则要想求梯度必须先求误差项
δ
t
\delta ^{t}
δt。
(4)中间层的误差项 δ t \delta ^{t} δt
根据RNN模型的求解过程可知,在某一序列位置
t
t
t的梯度损失由当前位置的输出对应的梯度损失和索引位置
t
+
1
t+1
t+1时的梯度损失两部分共同决定:
d
L
t
=
(
∂
L
t
∂
h
t
)
T
d
h
t
=
t
r
[
(
∂
L
t
∂
o
t
)
T
d
o
t
+
(
∂
L
t
+
1
∂
h
t
+
1
)
T
d
h
t
+
1
]
=
t
r
[
(
y
^
−
y
)
T
d
(
V
h
t
+
c
)
+
(
δ
t
+
1
)
T
d
tanh
(
U
x
t
+
1
+
W
h
t
+
b
)
]
=
t
r
[
(
y
^
−
y
)
T
V
d
h
t
+
(
δ
t
+
1
)
T
d
i
a
g
(
1
−
(
h
t
+
1
)
2
)
d
(
U
x
t
+
1
+
W
h
t
+
b
)
]
=
t
r
[
(
y
^
−
y
)
T
V
+
(
δ
t
+
1
)
T
d
i
a
g
(
1
−
(
h
t
+
1
)
2
)
W
]
d
h
t
\begin{aligned} dL^{t}&=(\frac{\partial L^{t}}{\partial h^{t}})^{T}dh^{t}\\ &=tr[(\frac{\partial L^{t}}{\partial o^{t}})^{T} do^{t}+(\frac{\partial L^{t+1}}{\partial h^{t+1}})^{T} dh^{t+1}]\\ &= tr[(\widehat{y}-y)^{T}d(\mathbf{V}h^{t}+\mathbf{c})+(\delta ^{t+1})^{T}d\tanh (\mathbf{U}x^{t+1}+\mathbf{W}h^{t}+\mathbf{b})]\\ &=tr[(\widehat{y}-y)^{T}\mathbf{V}dh^{t}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})d(\mathbf{U}x^{t+1}+\mathbf{W}h^{t}+\mathbf{b})]\\ &=tr[(\widehat{y}-y)^{T}\mathbf{V}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})\mathbf{W}]dh^{t} \end{aligned}
dLt=(∂ht∂Lt)Tdht=tr[(∂ot∂Lt)Tdot+(∂ht+1∂Lt+1)Tdht+1]=tr[(y
−y)Td(Vht+c)+(δt+1)Tdtanh(Uxt+1+Wht+b)]=tr[(y
−y)TVdht+(δt+1)Tdiag(1−(ht+1)2)d(Uxt+1+Wht+b)]=tr[(y
−y)TV+(δt+1)Tdiag(1−(ht+1)2)W]dht
由此可得误差项:
δ
t
=
∂
L
t
∂
h
t
=
[
(
y
^
−
y
)
T
V
+
(
δ
t
+
1
)
T
d
i
a
g
(
1
−
(
h
t
+
1
)
2
)
W
]
T
=
V
T
(
y
^
−
y
)
+
W
T
d
i
a
g
(
1
−
(
h
t
+
1
)
2
)
δ
t
+
1
\begin{aligned} \delta ^{t}&=\frac{\partial L^{t}}{\partial h^{t}}\\ &=[(\widehat{y}-y)^{T}\mathbf{V}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})\mathbf{W}]^{T}\\ &= \mathbf{V}^{T}(\widehat{y}-y)+\mathbf{W}^{T}diag(1-(h^{t+1})^{2})\delta ^{t+1} \end{aligned}
δt=∂ht∂Lt=[(y
−y)TV+(δt+1)Tdiag(1−(ht+1)2)W]T=VT(y
−y)+WTdiag(1−(ht+1)2)δt+1
因此,
W
,
U
,
b
\mathbf{W,U,b}
W,U,b的梯度公式更新为:
∂
L
t
∂
U
=
[
x
t
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
δ
t
(
x
t
)
T
∂
L
t
∂
W
=
[
h
t
−
1
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
δ
t
(
h
t
−
1
)
T
∂
L
t
∂
b
=
[
(
∂
L
t
∂
h
t
)
T
d
i
a
g
(
1
−
(
h
t
)
2
)
]
T
=
d
i
a
g
(
1
−
(
h
t
)
2
)
δ
t
\frac{\partial L^{t}}{\partial \mathbf{U}}=[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t}(x^{t})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{W}}=[h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t}(h^{t-1})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{b}}=[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t}
∂U∂Lt=[xt(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)δt(xt)T∂W∂Lt=[ht−1(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)δt(ht−1)T∂b∂Lt=[(∂ht∂Lt)Tdiag(1−(ht)2)]T=diag(1−(ht)2)δt
(5)最终层的误差项 δ T \delta ^{T} δT
中间层的梯度误差
δ
t
\delta ^{t}
δt可以通过后一层的梯度误差
δ
t
+
1
\delta ^{t+1}
δt+1求得,最后一层的梯度误差
δ
T
\delta ^{T}
δT之后没有
T
+
1
T+1
T+1,所以
δ
T
\delta ^{T}
δT只与当前层的输出
o
T
o^{T}
oT相关:
δ
T
=
V
T
(
y
^
−
y
)
\delta ^{T}= \mathbf{V}^{T}(\widehat{y}-y)
δT=VT(y
−y)