第七周【任务1】RNN概念&前向传播

循环神经网络就是为了学习卷积神经网络中权值共享等思路,来处理序列化数据, 这就造成了他们有很多类似的地方。

RNN与CNN的区别主要在输入形式上:

循环神经网络是一类用于处理序列数据的神经网络。卷积神经网络是一类用于处理网格化数据(如一个图像)的神经网络。

循环网络可以扩展到更长的序列。大多数循环网络也能处理可变长度的序列。卷积网络可以很容易地扩展到具有很大宽度和高度的图像,以及处理大小可变的图像。

循环图

展开图能够明确描述其中的计算流程。 展开图还通过显式的信息流动路径帮助说明信息在时间上向前(计算输出和损失) 和向后(计算梯度)的思想

在这里插入图片描述

(图片来源:花书page321)

最简单的RNN形式

在这里插入图片描述

在左边循环图中,x是神经网络的输入,U是输入 层到隐藏层之间的权重矩阵,W是记忆单元到隐 藏层之间的权重矩阵,V是隐藏层到输出层之间 的权重矩阵,S是隐藏层的输出,同时也是要保存 到记忆单元中,并与下一时刻的一起作为输入, O是神经网络的输出。【这里的W, V, U, 和全连接网络中的参数矩阵相同。

从左边的展开图中,可以看出,RNN每个时刻隐藏层的输出传递给下一个时刻,因此每个时刻的网络都会保留一定的来自之前时刻的历史信息,并结合当前时刻的网络状态一并再传给下一时刻。

前向传播

在这里插入图片描述

假设我们有一个1000ms长度的语音样本,内容是“早上好”。我么可以10ms取一个样本向量,那么共有100个样本向量。每一个向量进行重采样,得到160维,因此我们有一个 100 × 160 100\times 160 100×160的矩阵, t=100.比如说t=1 to 30, 对应zao,t=31 to 70, 对应shang, t=71 to 100, 对应hao. 又假定词袋标签有6000个,那么输出O向量的维度就是 1000 × 1 1000\times 1 1000×1。上图就是不同样本向量进入RNN网络进行训练的前向传播过程.

假定中间隐藏层有神经元1000个,那么 h h h的维度就是1000, U U U的维度就是 160 × 1000 160\times 1000 160×1000。因此,对于 t = 1 , 2 t=1, 2 t=1,2, 有
h 1 = x 1 U + b 1 h 2 = x 2 U + S 1 W + b 1 S 1 = f ( h 1 ) S 2 = f ( h 2 ) O 1 = S 1 V + b 2 O 2 = S 2 V + b 2 \begin{array}{ll} h_{1}=x_{1} U+b_{1} & h_{2}=x_{2} U+S_{1} W+b_{1} \\ S_{1}=f\left(h_{1}\right) & S_{2}=f\left(h_{2}\right) \\ O_{1}=S_{1} V+b_{2} & O_{2}=S_{2} V+b_{2} \end{array} h1=x1U+b1S1=f(h1)O1=S1V+b2h2=x2U+S1W+b1S2=f(h2)O2=S2V+b2
对于 t = t − 1 , t t=t-1, t t=t1,t,有
h t − 1 = x t − 1 U + S t − 2 W + b 1 h t = x t U + S t − 1 W + b 1 S t − 1 = f ( h t − 1 ) S t = f ( h t ) O t − 1 = S t − 1 V + b 2 O t = S t V + b 2 \begin{array}{ll} h_{t-1}=x_{t-1} U+S_{t-2} W+b_{1} & h_{t}=x_{t} U+S_{t-1} W+b_{1} \\ S_{t-1}=f\left(h_{t-1}\right) & S_{t}=f\left(h_{t}\right) \\ O_{t-1}=S_{t-1} V+b_{2} & O_{t}=S_{t} V+b_{2} \end{array} ht1=xt1U+St2W+b1St1=f(ht1)Ot1=St1V+b2ht=xtU+St1W+b1St=f(ht)Ot=StV+b2
其中 x i x_i xi是维度是160的一维向量, S i , h i S_i, h_i Si,hi均为 1000 × 1 1000\times 1 1000×1的一维向量, O i O_i Oi 6000 × 1 6000\times 1 6000×1的一维向量。可见,RNN就是通过共享 W , U , V W, U, V W,U,V三个参数矩阵进行训练学习的。如果不做权值共享,每个时刻都有自己的矩阵,那么可见参数量会随着时间尺度增加而增加。这里的权值共享的一个优点就是减少了100倍的参数量。而且另一个优点就是能适应不同数量的序列样本集。

后向传播

对于所有的输出,我们要把所有的输出预测与标签之间的差别加和起来才能得到损失函数
J = ∑ i = 1 t ∥ O i − O ~ i ∥ = J 1 + J 2 + ⋯ + J t J=\sum_{i=1}^{t}\left\|O_{i}-\widetilde{O}_{i}\right\|=J_{1}+J_{2}+\cdots+J_{t} J=i=1tOiO i=J1+J2++Jt
因此我们要针对不同的输出结点进行求导
∂ J ∂ o i = ∂ ( J 1 + J 2 + ⋯ + J t ) ∂ o i = ∂ J i ∂ o i \frac{\partial J}{\partial o_{i}}=\frac{\partial\left(J_{1}+J_{2}+\cdots+J_{t}\right)}{\partial o_{i}}=\frac{\partial J_{i}}{\partial o_{i}} oiJ=oi(J1+J2++Jt)=oiJi
这里每个对输出的梯度维度都是 6000 × 1 6000\times 1 6000×1

我们可以先回忆全连接FC网络的时候,对于一个隐藏层 y = X W y=XW y=XW, 我们有
∂ J ∂ X = ∂ J ∂ y W T ∂ J ∂ W = X T ∂ J ∂ y \dfrac{\partial J}{\partial X}=\dfrac{\partial J}{\partial y}W^{T}\\ \dfrac{\partial J}{\partial W}=X^{T}\dfrac{\partial J}{\partial y} XJ=yJWTWJ=XTyJ
这里也类似,我们先考虑对各个隐藏层的输入输出进行求导。对于倒数的两个时刻,有
∂ J ∂ S t = ∂ J ∂ O t V T ∂ J ∂ S t − 1 = ∂ J ∂ O t − 1 V T + ∂ J ∂ h t W T ∂ J ∂ h t = ∂ J ∂ S t d S t d h t ∂ J ∂ h t − 1 = ∂ J ∂ S t − 1 d S t − 1 d h t − 1 ∂ J ∂ x t = ∂ J ∂ h t U T ∂ J ∂ x t − 1 = ∂ J ∂ h t − 1 U T \begin{array}{ll} \dfrac{\partial J}{\partial S_{t}}=\dfrac{\partial J}{\partial O_{t}} V^{T} & \dfrac{\partial J}{\partial S_{t-1}}=\dfrac{\partial J}{\partial O_{t-1}} V^{T}+\dfrac{\partial J}{\partial h_{t}} W^{T} \\ \dfrac{\partial J}{\partial h_{t}}=\dfrac{\partial J}{\partial S_{t}} \dfrac{d S_{t}}{d h_{t}} & \dfrac{\partial J}{\partial h_{t-1}}=\dfrac{\partial J}{\partial S_{t-1}} \dfrac{d S_{t-1}}{d h_{t-1}} \\ \dfrac{\partial J}{\partial x_{t}}=\dfrac{\partial J}{\partial h_{t}} U^{T} \quad & \dfrac{\partial J}{\partial x_{t-1}}=\dfrac{\partial J}{\partial h_{t-1}} U^{T} \end{array} StJ=OtJVThtJ=StJdhtdStxtJ=htJUTSt1J=Ot1JVT+htJWTht1J=St1Jdht1dSt1xt1J=ht1JUT
一直到头两个时刻
∂ J ∂ S 2 = ∂ J ∂ O 2 V T + ∂ J ∂ h 3 W T ∂ J ∂ S 1 = ∂ J ∂ O 1 V T + ∂ J ∂ h 2 W T ∂ J ∂ h 2 = ∂ J ∂ S 2 d S 2 d h 2 ∂ J ∂ h 1 = ∂ J ∂ S 1 d S 1 d h 1 ∂ J ∂ x 2 = ∂ J ∂ h 2 U T ∂ J ∂ x 1 = ∂ J ∂ h 1 U T \begin{array}{ll} \dfrac{\partial J}{\partial S_{2}}=\dfrac{\partial J}{\partial O_{2}} V^{T}+\dfrac{\partial J}{\partial h_{3}} W^{T} & \dfrac{\partial J}{\partial S_{1}}=\dfrac{\partial J}{\partial O_{1}} V^{T}+\dfrac{\partial J}{\partial h_{2}} W^{T} \\ \dfrac{\partial J}{\partial h_{2}}=\dfrac{\partial J}{\partial S_{2}} \dfrac{d S_{2}}{d h_{2}} & \dfrac{\partial J}{\partial h_{1}}=\dfrac{\partial J}{\partial S_{1}} \dfrac{d S_{1}}{d h_{1}} \\ \dfrac{\partial J}{\partial x_{2}}=\dfrac{\partial J}{\partial h_{2}} U^{T} \quad & \dfrac{\partial J}{\partial x_{1}}=\dfrac{\partial J}{\partial h_{1}} U^{T} \end{array} S2J=O2JVT+h3JWTh2J=S2Jdh2dS2x2J=h2JUTS1J=O1JVT+h2JWTh1J=S1Jdh1dS1x1J=h1JUT
这里仍要注意, d S t d h t = S t ( 1 − S t ) \dfrac{dS_t}{dh_t}=S_{t}(1-S_{t}) dhtdSt=St(1St), 或者 1 − S t 2 1-S_{t}^{2} 1St2

下一步就是针对参数进行求导。在RNN中,有三个参数矩阵。我们先看 V V V,因为RNN有多个输出 O i O_{i} Oi,因此对 J ( O 1 , O 2 , ⋯   , O t ) J(O_1, O_2, \cdots, O_t) J(O1,O2,,Ot)
∂ J t ∂ V = S t T ∂ J ∂ o t ∂ J t − 1 ∂ V = S t − 1 T ∂ J ∂ o t − 1 ⋮ ∂ J 1 ∂ V = S 1 T ∂ J ∂ o 1 \begin{array}{l} \dfrac{\partial J_t}{\partial V}=S_{t}^{T} \dfrac{\partial J}{\partial o_{t}} \\ \dfrac{\partial J_{t-1}}{\partial V}=S_{t-1}^{T}\dfrac{\partial J}{\partial o_{t-1}} \\ \vdots \\ \dfrac{\partial J_{1}}{\partial V}=S_{1}^{T} \frac{\partial J}{\partial o_{1}} \end{array} VJt=StTotJVJt1=St1Tot1JVJ1=S1To1J
参数矩阵在前
∂ J ∂ V = ∑ i = 1 t S i T ∂ J ∂ o i \frac{\partial J}{\partial V}=\sum_{i=1}^{t} S_{i}^{T} \frac{\partial J}{\partial o_{i}} VJ=i=1tSiToiJ
类似,对于U和S有
∂ J ∂ U = ∑ i = 1 t x i T ∂ J ∂ h i ∂ J ∂ W = ∑ i = 1 t − 1 S i T ∂ J ∂ h i + 1 \frac{\partial J}{\partial U}=\sum_{i=1}^{t} x_{i}^{T} \frac{\partial J}{\partial h_{i}}\\ \frac{\partial J}{\partial W}=\sum_{i=1}^{t-1} S_{i}^{T} \frac{\partial J}{\partial h_{i+1}} UJ=i=1txiThiJWJ=i=1t1SiThi+1J
这里要注意,从图中可以看出,W存在于不同时刻之间,因此只有t-1项。而且从中我们也可以看出一点技巧,针对对输入输出求导,我们要看向量有几个箭头输出,比如说 S t − 1 S_{t-1} St1,它有两个箭头输出,分别指向 O t − 1 O_{t-1} Ot1 h t h_t ht,因此分别考虑对这两个向量求偏导,而 h t − 1 , S t h_{t-1}, S_{t} ht1,St都只有一个箭头输出。针对参数求导,我们就看这个参数的箭头连接的两个向量,比如说W, 它连接的就是 S i − 1 S_{i-1} Si1 h i + 1 h_{i+1} hi+1,那就先对箭头末端求导,然后乘以箭头后端的向量

因此有
∂ J ∂ V = ∑ i = 1 t S i T ∂ J ∂ o i = ( S 1 T , S 2 T , … , S t T ) ( ∂ J ∂ o 1 ⋮ ∂ J ∂ o t ) \begin{array}{l} \frac{\partial J}{\partial V}=\sum_{i=1}^{t} S_{i}^{T} \frac{\partial J}{\partial o_{i}} \\ =\left(S_{1}^{T}, S_{2}^{T}, \ldots, S_{t}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial o_{1}} \\ \vdots \\ \frac{\partial J}{\partial o_{t}} \end{array}\right) \end{array} VJ=i=1tSiToiJ=(S1T,S2T,,StT)o1JotJ

∂ J ∂ W = ∑ i = 1 t − 1 S i T ∂ J ∂ h i + 1 = ( S 1 T , S 2 T , … , S t − 1 T ) ( ∂ J ∂ h 2 ∂ J ∂ h t ) \begin{array}{l} \frac{\partial J}{\partial W}=\sum_{i=1}^{t-1} S_{i}^{T} \frac{\partial J}{\partial h_{i+1}} \\ =\left(S_{1}^{T}, S_{2}^{T}, \ldots, S_{t-1}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial h_{2}} \\ \frac{\partial J}{\partial h_{t}} \end{array}\right) \end{array} WJ=i=1t1SiThi+1J=(S1T,S2T,,St1T)(h2JhtJ)

∂ J ∂ U = ∑ i = 1 t x i T ∂ J ∂ h i = ( x 1 T , x 2 T , … , x t − 1 T ) ( ∂ J ∂ h 1 ⋮ ∂ J ∂ h t ) \begin{array}{l} \frac{\partial J}{\partial U}=\sum_{i=1}^{t} x_{i}^{T} \frac{\partial J}{\partial h_{i}} \\ =\left(x_{1}^{T}, x_{2}^{T}, \ldots, x_{t-1}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial h_{1}} \\ \vdots \\ \frac{\partial J}{\partial h_{t}} \end{array}\right) \end{array} UJ=i=1txiThiJ=(x1T,x2T,,xt1T)h1JhtJ

但是我们可以发现一个问题,除了在求对参数的梯度是矩阵乘以矩阵,可以并行,前向传播,还有对输入输出向量求导,却都是向量和矩阵的乘法,没有办法做到矩阵乘以矩阵。因此在GPU中,RNN没有办法做batch训练,发挥GPU性能。

比如说前向传播中, S 2 = f ( h 2 ) S_2 = f(h_2) S2=f(h2), 也就是 S 2 S_2 S2依赖 h 2 h_2 h2, h 2 h_2 h2依赖 S 1 S_{1} S1 S 1 S_1 S1依赖 。 h 1 。h_1 h1也就是当 x 2 x_2 x2输入时, S 2 S_2 S2 S 1 S_{1} S1没有办法同时算好,没有办法进行并行。

再比如,反向传播中, h t − 1 h_{t-1} ht1依赖 S t − 1 S_{t-1} St1 S t − 1 S_{t-1} St1依赖 h t h_{t} ht h t − 1 h_{t-1} ht1 h t h_{t} ht也没有办法同时算好,没有办法进行并行。

既然一句话中,无法进行并行训练,那么我们只能尝试多句并行训练。假设 x 1 N , x 2 N , x 3 N , ⋯   , x t N x_{1}^{N} , x_{2}^{N} , x_{3}^{N} , \cdots, x_{t}^{N} x1N,x2N,x3N,,xtN为第N句话的t个样本(以最长序列为准,不够补0),那么对于这N句话的第一个词,我们有
h 1 1 = x 1 1 U + b 1 S 1 1 = f ( h 1 1 ) O 1 1 = S 1 1 V + b 2 h 1 2 = x 1 2 U + b 1 S 1 2 = f ( h 1 2 ) O 1 2 = S 1 2 V + b 2 … h 1 N = x 1 N U + b 1 S 1 N = f ( h 1 N ) O 1 N = s 1 N V + b 2 \begin{array}{l} h_{1}^{1}=x_{1}^{1} U+b_{1} \\ S_{1}^{1}=f\left(h_{1}^{1}\right) \\ O_{1}^{1}=S_{1}^{1} V+b_{2} \\ h_{1}^{2}=x_{1}^{2} U+b_{1} \\ S_{1}^{2}=f\left(h_{1}^{2}\right) \\ O_{1}^{2}=S_{1}^{2} V+b_{2} \\ \ldots \\ h_{1}^{N}=x_{1}^{N} U+b_{1} \\ S_{1}^{N}=f\left(h_{1}^{N}\right) \\ O_{1}^{N}=s_{1}^{N} V+b_{2} \end{array} h11=x11U+b1S11=f(h11)O11=S11V+b2h12=x12U+b1S12=f(h12)O12=S12V+b2h1N=x1NU+b1S1N=f(h1N)O1N=s1NV+b2
因此
( h 1 1 ⋮ h 1 N ) = ( x 1 1 ⋮ x 1 N ) U + ( b 1 ⋮ b 1 ) ( S 1 1 ⋮ S 1 N ) = f ( h 1 1 ⋮ h 1 N ) ( O 1 1 ⋮ O 1 N ) = ( S 1 1 ⋮ S 1 N ) V + ( b 2 ⋮ b 2 ) \begin{array}{c} \left(\begin{array}{c} h_{1}^{1} \\ \vdots \\ h_{1}^{N} \end{array}\right)=\left(\begin{array}{c} x_{1}^{1} \\ \vdots \\ x_{1}^{N} \end{array}\right) U+\left(\begin{array}{c} b_{1} \\ \vdots \\ b_{1} \end{array}\right) \\ \left(\begin{array}{c} S_{1}^{1} \\ \vdots \\ S_{1}^{N} \end{array}\right)=f\left(\begin{array}{c} h_{1}^{1} \\ \vdots \\ h_{1}^{N} \end{array}\right) \\ \left(\begin{array}{c} O_{1}^{1} \\ \vdots \\ O_{1}^{N} \end{array}\right)=\left(\begin{array}{c} S_{1}^{1} \\ \vdots \\ S_{1}^{N} \end{array}\right) V+\left(\begin{array}{c} b_{2} \\ \vdots \\ b_{2} \end{array}\right) \end{array} h11h1N=x11x1NU+b1b1S11S1N=fh11h1NO11O1N=S11S1NV+b2b2
t − 1 t-1 t1时刻, 有
h t − 1 1 = x t − 1 1 U + S t − 2 1 W + b 1 S t − 1 1 = f ( h t − 1 1 ) O t − 1 1 = S t − 1 1 V + b 2 h t − 1 2 = x t − 1 2 U + S t − 2 2 W + b 1 S t − 1 2 = f ( h t − 1 2 ) O t − 1 2 = S t − 1 2 V + b 2 ⋯ h t − 1 N = x t − 1 N U + S t − 2 N W + b 1 S t − 1 N = f ( h t − 1 N ) O t − 1 N = S t − 1 N V + b 2 \begin{aligned} &\begin{array}{l} h_{t-1}^{1}=x_{t-1}^{1} U+S_{t-2}^{1} W+b_{1} \\ S_{t-1}^{1}=f\left(h_{t-1}^{1}\right) \\ O_{t-1}^{1}=S_{t-1}^{1} V+b_{2} \\ h_{t-1}^{2}=x_{t-1}^{2} U+S_{t-2}^{2} W+b_{1} \\ S_{t-1}^{2}=f\left(h_{t-1}^{2}\right) \\ {O}_{t-1}^{2}=S_{t-1}^{2} V+b_{2} \end{array}\\ &\cdots\\ &h_{t-1}^{N}=x_{t-1}^{N} U+S_{t-2}^{N} W+b_{1}\\ &\begin{array}{l} S_{t-1}^{N}=f\left(h_{t-1}^{N}\right) \\ O_{t-1}^{N}=S_{t-1}^{N} V+b_{2} \end{array} \end{aligned} ht11=xt11U+St21W+b1St11=f(ht11)Ot11=St11V+b2ht12=xt12U+St22W+b1St12=f(ht12)Ot12=St12V+b2ht1N=xt1NU+St2NW+b1St1N=f(ht1N)Ot1N=St1NV+b2
因此
( h t − 1 1 ⋮ h t − 1 N ) = ( x t − 1 1 ⋮ x t − 1 N ) U + ( S t − 2 1 ⋮ S t − 2 N ) W + ( b 1 ⋮ b 1 ) ( S t − 1 1 ⋮ S t − 1 N ) = f ( h t − 1 1 ⋮ h t − 1 N ) ( O t − 1 1 ⋮ O t − 1 N ) = ( S t − 1 1 ⋮ S t − 1 N ) V + ( b 2 ⋮ b 2 ) \begin{array}{l} \left(\begin{array}{c} h_{t-1}^{1} \\ \vdots \\ h_{t-1}^{N} \end{array}\right)=\left(\begin{array}{c} x_{t-1}^{1} \\ \vdots \\ x_{t-1}^{N} \end{array}\right) U+\left(\begin{array}{c} S_{t-2}^{1} \\ \vdots \\ S_{t-2}^{N} \end{array}\right) W+\left(\begin{array}{c} b_{1} \\ \vdots \\ b_{1} \end{array}\right) \\ \left(\begin{array}{c} S_{t-1}^{1} \\ \vdots \\ S_{t-1}^{N} \end{array}\right)=f\left(\begin{array}{c} h_{t-1}^{1} \\ \vdots \\ h_{t-1}^{N} \end{array}\right) \\ \left(\begin{array}{c} O_{t-1}^{1} \\ \vdots \\ O_{t-1}^{N} \end{array}\right)=\left(\begin{array}{c} S_{t-1}^{1} \\ \vdots \\ S_{t-1}^{N} \end{array}\right) V+\left(\begin{array}{c} b_{2} \\ \vdots \\ b_{2} \end{array}\right) \end{array} ht11ht1N=xt11xt1NU+St21St2NW+b1b1St11St1N=fht11ht1NOt11Ot1N=St11St1NV+b2b2
这样我们就可以构建针对不同句子中同一个时刻的单词(或者说样本向量)的矩阵的运算。

类似的,我们对求输入输出向量的梯度,有
∂ J ∂ S t − 1 1 = ∂ J ∂ O t − 1 1 V T + ∂ J ∂ h t 1 W T ∂ J ∂ h t − 1 1 = ∂ J ∂ S t − 1 1 ∂ S t − 1 1 ∂ h t − 1 1 ∂ J ∂ x t − 1 1 = ∂ J ∂ h t − 1 1 U T ⋮ ∂ J ∂ S t − 1 N = ∂ J ∂ O t − 1 N V T + ∂ J ∂ h t N W T ∂ J ∂ h t − 1 N = ∂ J ∂ S t − 1 N ∂ S t − 1 N ∂ h t − 1 N ∂ J ∂ x t − 1 N = ∂ J ∂ h t − 1 N U T \begin{aligned} \frac{\partial J}{\partial S_{t-1}^{1}} &=\frac{\partial J}{\partial O_{t-1}^{1}} V^{T}+\frac{\partial J}{\partial h_{t}^{1}} W^{T} \\ \frac{\partial J}{\partial h_{t-1}^{1}} &=\frac{\partial J}{\partial S_{t-1}^{1}} \frac{\partial S_{t-1}^{1}}{\partial h_{t-1}^{1}} \\ \frac{\partial J}{\partial x_{t-1}^{1}} &=\frac{\partial J}{\partial h_{t-1}^{1}} U^{T} \\ & \vdots \\ \frac{\partial J}{\partial S_{t-1}^{N}} &=\frac{\partial J}{\partial O_{t-1}^{N}} V^{T}+\frac{\partial J}{\partial h_{t}^{N}} W^{T} \\ \frac{\partial J}{\partial h_{t-1}^{N}} &=\frac{\partial J}{\partial S_{t-1}^{N}} \frac{\partial S_{t-1}^{N}}{\partial h_{t-1}^{N}} \\ \frac{\partial J}{\partial x_{t-1}^{N}} &=\frac{\partial J}{\partial h_{t-1}^{N}} U^{T} \end{aligned} St11Jht11Jxt11JSt1NJht1NJxt1NJ=Ot11JVT+ht1JWT=St11Jht11St11=ht11JUT=Ot1NJVT+htNJWT=St1NJht1NSt1N=ht1NJUT
因此
( ∂ J ∂ S t − 1 1 ⋮ ∂ J ∂ S t − 1 N ) = ( ∂ J ∂ O t − 1 1 ⋮ ∂ J ∂ O t − 1 N ) V T + ( ∂ J ∂ h t 1 ⋮ ∂ J ∂ h t N ) W T \left(\begin{array}{c} \dfrac{\partial J}{\partial S_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial S_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial O_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial O_{t-1}^{N}} \end{array}\right) V^{T}+\left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t}^{N}} \end{array}\right) W^{T} St11JSt1NJ=Ot11JOt1NJVT+ht1JhtNJWT

( ∂ J ∂ h t − 1 1 ⋮ ∂ J ∂ h t − 1 N ) = ( ∂ J ∂ S t − 1 1 ⋮ ∂ J ∂ S t − 1 N ) ⊙ ( ∂ S t − 1 1 ∂ h t − 1 1 ⋮ ∂ S t − 1 N ∂ h t − 1 N ) \left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial S_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial S_{t-1}^{N}} \end{array}\right) \odot\left(\begin{array}{c} \dfrac{\partial S_{t-1}^{1}}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial S_{t-1}^{N}}{\partial h_{t-1}^{N}} \end{array}\right) ht11Jht1NJ=St11JSt1NJht11St11ht1NSt1N

( ∂ J ∂ x t − 1 1 ⋮ ∂ J ∂ x t − 1 N ) = ( ∂ J ∂ h t − 1 1 ⋮ ∂ J ∂ h t − 1 N ) U T \left(\begin{array}{c} \dfrac{\partial J}{\partial x_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial x_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t-1}^{N}} \end{array}\right) U^{T} xt11Jxt1NJ=ht11Jht1NJUT

这样,也能对输入输出向量的梯度进行并行计算了。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是一个简单的RNN模型的前向传播和反向传播的实现代码: ```python import torch import torch.nn as nn class SimpleRNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(SimpleRNN, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self.softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hidden_size) # 定义超参数 input_size = 10 hidden_size = 20 output_size = 5 seq_len = 3 # 构建RNN模型 rnn = SimpleRNN(input_size, hidden_size, output_size) # 定义输入和输出 input_data = torch.randn(seq_len, 1, input_size) target_data = torch.tensor([[1], [2], [3]]) # 初始化隐藏状态 hidden = rnn.initHidden() # 前向传播 for i in range(seq_len): output, hidden = rnn(input_data[i], hidden) # 计算损失函数 loss_func = nn.NLLLoss() loss = loss_func(output, target_data) # 反向传播 rnn.zero_grad() loss.backward() ``` 在上面的代码中,我们首先定义了一个简单的RNN模型,其中包含一个输入层、一个隐藏层和一个输出层。在forward函数中,我们将输入和上一个时间步的隐藏状态拼接在一起,然后通过线性变换和LogSoftmax函数得到输出和新的隐藏状态。在反向传播中,我们首先计算损失函数,然后通过调用`rnn.zero_grad()`将所有参数的梯度清零,最后调用`loss.backward()`来计算所有参数的梯度。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值