四十二.循环神经网络(RNN)原理

1.网络结构

全连接网络和卷积网络都属于前向反馈网络,模型的输出和模型本身没有关联。而循环神经网络的输出和模型间有反馈。
循环神经网络的输出和模型之间之所以有反馈,就是因为其引入了记忆体的概念。 t t t时刻的输入为 x t x^{t} xt,输出为 y t y^{t} yt,记忆体为 h t h^{t} ht。其中, h t h^{t} ht x t x^{t} xt和上一时刻的记忆体 h t − 1 h^{t-1} ht1求得, y t y^{t} yt由记忆体 h t h^{t} ht求得。
t t t时刻网络结构图如下所示:
在这里插入图片描述
将循环网络按时间步 t t t展开:
在这里插入图片描述
在前向传播的过程中,主要更新记忆体 h t h^{t} ht和输出 y t y^{t} yt,参数矩阵 W \mathbf{W} W是不变的。
在反向传播中,主要用梯度下降来更新参数矩阵 W \mathbf{W} W
循环神经网络借助循环核对时间特征进行提取,然后将提取的特征送入全连接网络进行预测。
循环核的数量是可以随意指定的,如下图:
在这里插入图片描述
由于记忆体的存在,循环神经网络保留了历史信息,常用来处理语音、文字等序列相关的信息。

2.前向传播

(1)记忆体 h t h^{t} ht

t t t时刻的隐藏状态 h t h^{t} ht t t t时刻的输入 x t x^{t} xt t − 1 t-1 t1时刻的隐藏状态 h t − 1 h^{t-1} ht1共同决定:
h t = σ ( U x t + W h t − 1 + b ) h^{t}=\sigma (\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b}) ht=σ(Uxt+Wht1+b)
其中, U \mathbf{U} U W \mathbf{W} W为全局共享的系数矩阵, b \mathbf{b} b为偏置向量,$\sigma $为激活函数,一般为tanh。

(2)预测值 y ^ t \widehat{y}^{t} y t

y ^ t = σ ( V h t + c ) \widehat{y}^{t}=\sigma (\mathbf{V}h^{t}+\mathbf{c}) y t=σ(Vht+c)
V \mathbf{V} V为全局共享的系数矩阵, c \mathbf{c} c为偏置向量,$\sigma $为激活函数,一般为softmax。

3.反向传播

(1)损失函数

模型的整体损失定义为各个时刻的损失之和:
L = ∑ t = 1 T L t L=\sum _{t=1}^{T}L^{t} L=t=1TLt
t t t时刻的损失一般使用交叉熵损失:
L t = − ( y t ) T log ⁡ y ^ t L^{t}=-(\mathbf{y}^{t})^{T}\log \widehat{\mathbf{y}}^{t} Lt=(yt)Tlogy t
其中, y t \mathbf{y}^{t} yt为真实值, y ^ t \widehat{\mathbf{y}}^{t} y t为预测值,它们都是one-hot向量。
已知预测值 y ^ t \widehat{\mathbf{y}}^{t} y t的计算过程:
y ^ t = σ ( V h t + c ) \widehat{y}^{t}=\sigma (\mathbf{V}h^{t}+\mathbf{c}) y t=σ(Vht+c)
为方便计算,引入中间变量:
o t = V h t + c o^{t}=\mathbf{V}h^{t}+\mathbf{c} ot=Vht+c
则:
y ^ t = σ ( o t ) = S o f t m a x ( o t ) = e o t 1 k T e o t \widehat{y}^{t}=\sigma (o^{t})=Softmax(o^{t})=\frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}} y t=σ(ot)=Softmax(ot)=1kTeoteot
上式中, 1 k T 1_{k}^{T} 1kT为全 1 1 1向量, K K K为类别数,则分母为标量;分子为向量,则预测值 y ^ t \widehat{y}^{t} y t为向量,带入损失函数 L t L^{t} Lt
L t = − ( y t ) T log ⁡ y ^ t = − ( y t ) T ln ⁡ e o t 1 k T e o t = − ( y t ) T ln ⁡ e o t + ( y t ) T 1 k ln ⁡ 1 k T e o t = ln ⁡ 1 k T e o t − ( y t ) T o t , ( ( y t ) T 1 k = 1 ) \begin{aligned} L^{t}&=-(\mathbf{y}^{t})^{T}\log \widehat{\mathbf{y}}^{t}\\ &= -(\mathbf{y}^{t})^{T}\ln \frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}}\\ & = -(\mathbf{y}^{t})^{T}\ln e^{o^{t}}+(\mathbf{y}^{t})^{T}1_{k}\ln 1_{k}^{T}e^{o^{t}}\\ & = \ln 1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}o^{t},((\mathbf{y}^{t})^{T}1_{k}=1) \end{aligned} Lt=(yt)Tlogy t=(yt)Tln1kTeoteot=(yt)Tlneot+(yt)T1kln1kTeot=ln1kTeot(yt)Tot,((yt)T1k=1)
上式中, ln ⁡ 1 k T e o t , ( y t ) T o t \ln 1_{k}^{T}e^{o^{t}},(\mathbf{y}^{t})^{T}o^{t} ln1kTeot,(yt)Tot都是标量,所以最终结果 L t L^{t} Lt为标量。
所以, t t t时刻的损失函数为:
L t = ln ⁡ 1 k T e o t − ( y t ) T o t L^{t}= \ln 1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}o^{t} Lt=ln1kTeot(yt)Tot

(2) L t L^{t} Lt V , c \mathbf{V,c} V,c求梯度

标量对矩阵向量求导,使用矩阵微分和迹函数公式:
d L t = t r [ d ln ⁡ 1 k T e o t − d ( y t ) T o t ] = t r [ 1 1 k T e o t d 1 k T e o t − ( y t ) T d o t ] = t r [ 1 k T e o t 1 k T e o t ⊙ d o t − ( y t ) T d o t ] = t r [ ( 1 k T ⊙ e o t ) T 1 k T e o t d o t − ( y t ) T d o t ] = t r [ ( e o t ) T 1 k T e o t d o t − ( y t ) T d o t ] = t r [ ( ( y ^ t ) T − ( y t ) T ) d ( V h t + c ) ] , ( y ^ t = e o t 1 k T e o t ) = t r [ h t ( ( y ^ t ) T − ( y t ) T ) d V − ( ( y ^ t ) T − ( y t ) T ) d c ] \begin{aligned} dL^{t}&=tr[d\ln 1_{k}^{T}e^{o^{t}}-d(\mathbf{y}^{t})^{T}o^{t}] \\ &= tr[\frac{1}{1_{k}^{T}e^{o^{t}}}d1_{k}^{T}e^{o^{t}}-(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[\frac{1_{k}^{T}e^{o^{t}}}{1_{k}^{T}e^{o^{t}}}\odot do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr [\frac{(1_{k}^{T}\odot e^{o^{t}})^{T}}{1_{k}^{T}e^{o^{t}}} do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[\frac{(e^{o^{t}})^{T}}{1_{k}^{T}e^{o^{t}}}do^{t} -(\mathbf{y}^{t})^{T}do^{t}]\\ &=tr[((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})d(\mathbf{V}h^{t}+c)],(\widehat{y}^{t}=\frac{e^{o^{t}}}{1_{k}^{T}e^{o^{t}}})\\ & = tr[h^{t}((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})d\mathbf{V}-((\widehat{y}^{t})^{T}-(\mathbf{y}^{t})^{T})dc] \end{aligned} dLt=tr[dln1kTeotd(yt)Tot]=tr[1kTeot1d1kTeot(yt)Tdot]=tr[1kTeot1kTeotdot(yt)Tdot]=tr[1kTeot(1kTeot)Tdot(yt)Tdot]=tr[1kTeot(eot)Tdot(yt)Tdot]=tr[((y t)T(yt)T)d(Vht+c)],(y t=1kTeoteot)=tr[ht((y t)T(yt)T)dV((y t)T(yt)T)dc]
最终,通过矩阵微分和导数的关系可得:
∂ L ∂ V = ∑ t = 1 T ∂ L t ∂ V = ∑ t = 1 T [ h t ( ( y ^ t ) T − ( y t ) T ) ] T = ∑ t = 1 T ( y ^ t − y t ) ( h t ) T ∂ L ∂ c = ∑ t = 1 T ∂ L t ∂ c = ∑ t = 1 T [ ( y ^ t ) T − ( y t ) T ] T = ∑ t = 1 T ( y ^ t − y t ) \frac{\partial L}{\partial \mathbf{V}}=\sum_{t=1}^{T}\frac{\partial L^{t}}{\partial \mathbf{V}}=\sum_{t=1}^{T}[h^{t}((\widehat{\mathbf{y}}^{t})^{T}-(\mathbf{y}^{t})^{T})]^{T}=\sum_{t=1}^{T}(\widehat{\mathbf{y}}^{t}-\mathbf{y}^{t})(h^{t})^{T}\\ \frac{\partial L}{\partial \mathbf{c}}=\sum_{t=1}^{T}\frac{\partial L^{t}}{\partial \mathbf{c}}=\sum_{t=1}^{T}[(\widehat{\mathbf{y}}^{t})^{T}-(\mathbf{y}^{t})^{T}]^{T}=\sum_{t=1}^{T}(\widehat{\mathbf{y}}^{t}-\mathbf{y}^{t}) VL=t=1TVLt=t=1T[ht((y t)T(yt)T)]T=t=1T(y tyt)(ht)TcL=t=1TcLt=t=1T[(y t)T(yt)T]T=t=1T(y tyt)

(3) L t L^{t} Lt W , U , b \mathbf{W,U,b} W,U,b求梯度

已知:
h t = tanh ⁡ ( U x t + W h t − 1 + b ) h^{t}=\tanh (\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b}) ht=tanh(Uxt+Wht1+b)
根据 tanh ⁡ \tanh tanh函数和导数的关系,可得:
( h t ) ′ = 1 − ( h t ) 2 (h^{t})^{'}=1-(h^{t})^{2} (ht)=1(ht)2
L t L^{t} Lt求导,标量对向量求导,使用矩阵微分和迹函数公式:
d L t = t r [ ( ∂ L t ∂ h t ) T d tanh ⁡ ( U x t + W h t − 1 + b ) ] = t r [ ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d ( U x t + W h t − 1 + b ) ] = t r [ ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d ( U ) x t + ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d ( W ) h t − 1 + ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d b ] = t r [ x t ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d U + h t − 1 ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d W + ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) d b ] \begin{aligned} dL^{t}&=tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}d\tanh(\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})] \\ &= tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{U}x^{t}+\mathbf{W}h^{t-1}+\mathbf{b})]\\ &= tr[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{U})x^{t}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d(\mathbf{W})h^{t-1}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{b}]\\ &= tr[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{U}+h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{W}+(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})d\mathbf{b}] \end{aligned} dLt=tr[(htLt)Tdtanh(Uxt+Wht1+b)]=tr[(htLt)Tdiag(1(ht)2)d(Uxt+Wht1+b)]=tr[(htLt)Tdiag(1(ht)2)d(U)xt+(htLt)Tdiag(1(ht)2)d(W)ht1+(htLt)Tdiag(1(ht)2)db]=tr[xt(htLt)Tdiag(1(ht)2)dU+ht1(htLt)Tdiag(1(ht)2)dW+(htLt)Tdiag(1(ht)2)db]
最终,通过矩阵微分和导数的关系可得:
∂ L t ∂ U = [ x t ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) ∂ L t ∂ h t ( x t ) T ∂ L t ∂ W = [ h t − 1 ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) ∂ L t ∂ h t ( h t − 1 ) T ∂ L t ∂ b = [ ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) ∂ L t ∂ h t \frac{\partial L^{t}}{\partial \mathbf{U}}=[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}}(x^{t})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{W}}=[h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}}(h^{t-1})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{b}}=[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\frac{\partial L^{t}}{\partial h^{t}} ULt=[xt(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)htLt(xt)TWLt=[ht1(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)htLt(ht1)TbLt=[(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)htLt
以上三个梯度公式中都有 ∂ L t ∂ h t \frac{\partial L^{t}}{\partial h^{t}} htLt,记公共项 ∂ L t ∂ h t \frac{\partial L^{t}}{\partial h^{t}} htLt为误差项 δ t \delta ^{t} δt,则要想求梯度必须先求误差项 δ t \delta ^{t} δt

(4)中间层的误差项 δ t \delta ^{t} δt

根据RNN模型的求解过程可知,在某一序列位置 t t t的梯度损失由当前位置的输出对应的梯度损失和索引位置 t + 1 t+1 t+1时的梯度损失两部分共同决定:
d L t = ( ∂ L t ∂ h t ) T d h t = t r [ ( ∂ L t ∂ o t ) T d o t + ( ∂ L t + 1 ∂ h t + 1 ) T d h t + 1 ] = t r [ ( y ^ − y ) T d ( V h t + c ) + ( δ t + 1 ) T d tanh ⁡ ( U x t + 1 + W h t + b ) ] = t r [ ( y ^ − y ) T V d h t + ( δ t + 1 ) T d i a g ( 1 − ( h t + 1 ) 2 ) d ( U x t + 1 + W h t + b ) ] = t r [ ( y ^ − y ) T V + ( δ t + 1 ) T d i a g ( 1 − ( h t + 1 ) 2 ) W ] d h t \begin{aligned} dL^{t}&=(\frac{\partial L^{t}}{\partial h^{t}})^{T}dh^{t}\\ &=tr[(\frac{\partial L^{t}}{\partial o^{t}})^{T} do^{t}+(\frac{\partial L^{t+1}}{\partial h^{t+1}})^{T} dh^{t+1}]\\ &= tr[(\widehat{y}-y)^{T}d(\mathbf{V}h^{t}+\mathbf{c})+(\delta ^{t+1})^{T}d\tanh (\mathbf{U}x^{t+1}+\mathbf{W}h^{t}+\mathbf{b})]\\ &=tr[(\widehat{y}-y)^{T}\mathbf{V}dh^{t}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})d(\mathbf{U}x^{t+1}+\mathbf{W}h^{t}+\mathbf{b})]\\ &=tr[(\widehat{y}-y)^{T}\mathbf{V}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})\mathbf{W}]dh^{t} \end{aligned} dLt=(htLt)Tdht=tr[(otLt)Tdot+(ht+1Lt+1)Tdht+1]=tr[(y y)Td(Vht+c)+(δt+1)Tdtanh(Uxt+1+Wht+b)]=tr[(y y)TVdht+(δt+1)Tdiag(1(ht+1)2)d(Uxt+1+Wht+b)]=tr[(y y)TV+(δt+1)Tdiag(1(ht+1)2)W]dht
由此可得误差项:
δ t = ∂ L t ∂ h t = [ ( y ^ − y ) T V + ( δ t + 1 ) T d i a g ( 1 − ( h t + 1 ) 2 ) W ] T = V T ( y ^ − y ) + W T d i a g ( 1 − ( h t + 1 ) 2 ) δ t + 1 \begin{aligned} \delta ^{t}&=\frac{\partial L^{t}}{\partial h^{t}}\\ &=[(\widehat{y}-y)^{T}\mathbf{V}+(\delta ^{t+1})^{T}diag(1-(h^{t+1})^{2})\mathbf{W}]^{T}\\ &= \mathbf{V}^{T}(\widehat{y}-y)+\mathbf{W}^{T}diag(1-(h^{t+1})^{2})\delta ^{t+1} \end{aligned} δt=htLt=[(y y)TV+(δt+1)Tdiag(1(ht+1)2)W]T=VT(y y)+WTdiag(1(ht+1)2)δt+1
因此, W , U , b \mathbf{W,U,b} W,U,b的梯度公式更新为:
∂ L t ∂ U = [ x t ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) δ t ( x t ) T ∂ L t ∂ W = [ h t − 1 ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) δ t ( h t − 1 ) T ∂ L t ∂ b = [ ( ∂ L t ∂ h t ) T d i a g ( 1 − ( h t ) 2 ) ] T = d i a g ( 1 − ( h t ) 2 ) δ t \frac{\partial L^{t}}{\partial \mathbf{U}}=[x^{t}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t}(x^{t})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{W}}=[h^{t-1}(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t}(h^{t-1})^{T}\\ \frac{\partial L^{t}}{\partial \mathbf{b}}=[(\frac{\partial L^{t}}{\partial h^{t}})^{T}diag(1-(h^{t})^{2})]^{T}=diag(1-(h^{t})^{2})\delta ^{t} ULt=[xt(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)δt(xt)TWLt=[ht1(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)δt(ht1)TbLt=[(htLt)Tdiag(1(ht)2)]T=diag(1(ht)2)δt

(5)最终层的误差项 δ T \delta ^{T} δT

中间层的梯度误差 δ t \delta ^{t} δt可以通过后一层的梯度误差 δ t + 1 \delta ^{t+1} δt+1求得,最后一层的梯度误差 δ T \delta ^{T} δT之后没有 T + 1 T+1 T+1,所以 δ T \delta ^{T} δT只与当前层的输出 o T o^{T} oT相关:
δ T = V T ( y ^ − y ) \delta ^{T}= \mathbf{V}^{T}(\widehat{y}-y) δT=VT(y y)

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值