RNN 训练算法 —— 前篇

在这里插入图片描述

问题描述

考虑模型循环网络模型:
x ( k ) = f [ W x ( k − 1 ) ] (1) x(k) = f[Wx(k-1)] \tag1{} x(k)=f[Wx(k1)](1)
其中 x ( k ) ∈ R N x(k) \in R^N x(k)RN表示网络节点状态, W ∈ R N × N W\in R^{N\times N} WRN×N表示网络结点之间相互连接的权重,网络的输出节点为 { x i ( k ) ∣ i ∈ O } \{x_i(k)| i\in O\} {xi(k)iO} O O O为所有输出(或称“观测”)单元的下标集合

在这里插入图片描述
训练的目标是为了减少观测状态和预期值之间误差,即最小化损失函数:
E = 1 2 ∑ k = 1 K ∑ i ∈ O [ x i ( k ) − d i ( k ) ] 2 (2) E = \frac{1}{2}\sum_{k=1}^K \sum_{i\in O} [x_i(k) - d_i(k)]^2 \tag{2} E=21k=1KiO[xi(k)di(k)]2(2)
其中 d i ( k ) d_i(k) di(k) 表示 k k k 时刻第 i i i 个节点的预期值

采用梯度下降法更新 W W W:
W + = W − η d E d W W_+ = W - \eta \frac{dE}{dW} W+=WηdWdE

符号约定

W ≡ [ —– w 1 T —– ⋮ —– w N T —– ] N × N W \equiv \begin{bmatrix} \text{-----} w_1^T \text{-----} \\ \vdots \\ \text{-----} w_N^T \text{-----} \end{bmatrix}_{N\times N} W—–w1T—–—–wNT—–N×N
将矩阵 W W W 拉成列向量,记为 w w w
w = [ w 1 T , ⋯   , w N T ] T ∈ R N 2 w = [w_1^T, \cdots, w_N^T]^T \in R^{N^2} w=[w1T,,wNT]TRN2
把所有时间的状态拼成列向量,记为 x x x
x = [ x T ( 1 ) , ⋯   , x T ( K ) ] T ∈ R N K x = [x^T(1), \cdots, x^T(K)]^T \in R^{NK} x=[xT(1),,xT(K)]TRNK
将RNN 的训练视为约束优化问题,(1)式转化成约束条件:
g ( k ) ≡ f [ W x ( k − 1 ) ] − x ( k ) = 0 , k = 1 , … , K (3) g(k) \equiv f[Wx(k-1)] - x(k) =0, \quad k=1,\ldots ,K \tag{3} g(k)f[Wx(k1)]x(k)=0,k=1,,K(3)

g = [ g T ( 1 ) , … , g T ( K ) ] T ∈ R N K g = [g^T(1), \ldots, g^T(K)]^T \in R^{NK} g=[gT(1),,gT(K)]TRNK

主要推导

由于 x x x w w w 之间满足约束条件(3),故 x x x 可视为 w w w 的函数,即 x ( w ) x(w) x(w)

因此 E ( x ) → E ( x ( w ) ) E(x) \to E(x(w)) E(x)E(x(w)) g ( x , w ) → g ( x ( w ) , w ) g(x,w) \to g(x(w),w) g(x,w)g(x(w),w)

g ≡ 0 g\equiv 0 g0
0 = d g ( x ( w ) , w ) d w = ∂ g ( x ( w ) , w ) ∂ x ∂ x ( w ) ∂ w + ∂ g ( x ( w ) , w ) ∂ w (4) 0 = \frac{dg(x(w),w)}{dw} = \frac{\partial g(x(w),w)}{\partial x}\frac{\partial x(w)}{\partial w} + \frac{\partial g(x(w),w)}{\partial w} \tag{4} 0=dwdg(x(w),w)=xg(x(w),w)wx(w)+wg(x(w),w)(4)

d E ( x ( w ) ) d w = ∂ E ( x ( w ) ) ∂ x ∂ x ( w ) ∂ w = − ∂ E ( x ( w ) ) ∂ x ( ∂ g ( x ( w ) , w ) ∂ x ) − 1 ∂ g ( x ( w ) , w ) ∂ w \begin{aligned} \frac{dE(x(w))}{dw} &= \frac{\partial E(x(w))}{\partial x}\frac{\partial x(w)}{\partial w} \\\\ &= -\frac{\partial E(x(w))}{\partial x}\left(\frac{\partial g(x(w),w)}{\partial x}\right)^{-1} \frac{\partial g(x(w),w)}{\partial w} \end{aligned} dwdE(x(w))=xE(x(w))wx(w)=xE(x(w))(xg(x(w),w))1wg(x(w),w)
简记为
d E d w = ∂ E ∂ x ( ∂ g ∂ x ) − 1 ∂ g ∂ w (5) \frac{dE}{dw} = \frac{\partial E}{\partial x}\left(\frac{\partial g}{\partial x}\right)^{-1} \frac{\partial g}{\partial w} \tag{5} dwdE=xE(xg)1wg(5)
大部分关于循环神经网络的梯度下降法,都是围绕(5)式展开

首先得清楚各项的维度:
E ∈ R g ∈ R N K x ∈ R N K w ∈ R N 2 ∂ E ∂ x ∈ R 1 × N K ∂ g ∂ x ∈ R N K × N K ∂ g ∂ w ∈ R N K × N 2 \begin{aligned} E &\in R \\ g &\in R^{NK}\\ x &\in R^{NK}\\ w &\in R^{N^2}\\ \frac{\partial E}{\partial x} &\in R^{1\times NK} \\ \frac{\partial g}{\partial x} &\in R^{NK\times NK} \\ \frac{\partial g}{\partial w} &\in R^{NK \times N^2} \end{aligned} EgxwxExgwgRRNKRNKRN2R1×NKRNK×NKRNK×N2

然后再看怎么求:
1.
∂ E ∂ x = [ e ( 1 ) , … , e ( K ) ] e i ( k ) = { x i ( k ) − d i ( k ) , if  i ∈ O , 0 , otherwise.  k ∈ 1 , … , K . \begin{aligned} \frac{\partial E}{\partial x} &= [e(1), \ldots, e(K)] \\\\ e_i(k)&= \begin{cases} x_i(k) - d_i(k), &\text{if } i\in O, \\ 0, &\text{otherwise. } \end{cases} k \in 1,\ldots,K. \end{aligned} xEei(k)=[e(1),,e(K)]={xi(k)di(k),0,if iO,otherwise. k1,,K.
2.
∂ g ∂ x = [ ∂ g ( 1 ) ∂ x ⋮ ∂ g ( K ) ∂ x ] = [ ∂ g ( 1 ) ∂ x ( 1 ) … ∂ g ( 1 ) ∂ x ( K ) ⋮ ⋱ ⋮ ∂ g ( K ) ∂ x ( 1 ) … ∂ g ( K ) ∂ x ( K ) ] \frac{\partial g}{\partial x} = \begin{bmatrix} \frac{\partial g(1)}{\partial x}\\ \vdots \\ \frac{\partial g(K)}{\partial x} \end{bmatrix} = \begin{bmatrix} \frac{\partial g(1)}{\partial x(1)} & \ldots & \frac{\partial g(1)}{\partial x(K)}\\ \vdots & \ddots & \vdots\\ \frac{\partial g(K)}{\partial x(1)} & \ldots & \frac{\partial g(K)}{\partial x(K)} \end{bmatrix} xg=xg(1)xg(K)=x(1)g(1)x(1)g(K)x(K)g(1)x(K)g(K)
由(3)式可知:
∂ g ( i ) ∂ x ( j ) = { − I , if  i = j , ∂ f [ W x ( j ) ] ∂ x ( j ) , if i=j+1 0 , otherwise.  \frac{\partial g(i)}{\partial x(j)} = \begin{cases} -I, &\text{if } i=j, \\ \frac{\partial f[Wx(j)]}{\partial x(j)} ,&\text{if i=j+1}\\ 0, &\text{otherwise. } \end{cases} x(j)g(i)=I,x(j)f[Wx(j)],0,if i=j,if i=j+1otherwise. 
而其中
∂ f [ W x ( j ) ] ∂ x ( j ) = [ ∂ f ( w 1 T x ( j ) ) ∂ x 1 ( j ) … ∂ f ( w 1 T x ( j ) ) ∂ x N ( j ) ⋮ ⋱ ⋮ ∂ f ( w N T x ( j ) ) ∂ x 1 ( j ) … ∂ f ( w N T x ( j ) ) ∂ x N ( j ) ] = [ f ′ ( w 1 T x ( j ) ) w 11 … f ′ ( w 1 T x ( j ) ) w 1 N ⋮ ⋱ ⋮ f ′ ( w N T x ( j ) ) w N 1 … f ′ ( w N T x ( j ) ) w N N ] = [ f ′ ( w 1 T x ( j ) ) 0 ⋱ 0 f ′ ( w N T x ( j ) ) ] W ≜ D ( j ) W \begin{aligned} \frac{\partial f[Wx(j)]}{\partial x(j)} & = \begin{bmatrix} \frac{\partial f(w_1^Tx(j))}{\partial x_1(j)} & \ldots & \frac{\partial f(w_1^Tx(j))}{\partial x_N(j)}\\ \vdots & \ddots & \vdots\\ \frac{\partial f(w_N^Tx(j))}{\partial x_1(j)}& \ldots & \frac{\partial f(w_N^Tx(j))}{\partial x_N(j)} \end{bmatrix}\\\\ & = \begin{bmatrix} f'(w_1^Tx(j))w_{11} & \ldots & f'(w_1^Tx(j))w_{1N}\\ \vdots & \ddots & \vdots\\ f'(w_N^Tx(j))w_{N1} & \ldots & f'(w_N^Tx(j))w_{NN} \end{bmatrix}\\\\ &= \begin{bmatrix} f'(w_1^Tx(j)) & &0\\ & \ddots & \\ 0& & f'(w_N^Tx(j)) \end{bmatrix}W \\\\ &\triangleq D(j)W \end{aligned} x(j)f[Wx(j)]=x1(j)f(w1Tx(j))x1(j)f(wNTx(j))xN(j)f(w1Tx(j))xN(j)f(wNTx(j))=f(w1Tx(j))w11f(wNTx(j))wN1f(w1Tx(j))w1Nf(wNTx(j))wNN=f(w1Tx(j))00f(wNTx(j))WD(j)W
综上所述
∂ g ∂ x = [ − I 0 0 … 0 D ( 1 ) W − I 0 … 0 0 D ( 2 ) W − I … 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 D ( K − 1 ) W − I ] N K × N K \frac{\partial g}{\partial x} = \begin{bmatrix} -I & 0& 0 &\ldots & 0\\ D(1)W & -I & 0 &\ldots & 0 \\ 0 & D(2)W & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & D(K-1)W& -I \end{bmatrix}_{NK\times NK} xg=ID(1)W000ID(2)W000I0D(K1)W000INK×NK

∂ g ∂ w = [ ∂ g ( 1 ) ∂ w ⋮ ∂ g ( K ) ∂ w ] = [ ∂ f [ W x ( 0 ) ] ∂ w ⋮ ∂ f [ W x ( K − 1 ) ∂ w ] \frac{\partial g}{\partial w} = \begin{bmatrix} \frac{\partial g(1)}{\partial w}\\ \vdots \\ \frac{\partial g(K)}{\partial w} \end{bmatrix}= \begin{bmatrix} \frac{\partial f[Wx(0)]}{\partial w}\\ \vdots \\ \frac{\partial f[Wx(K-1)}{\partial w} \end{bmatrix} wg=wg(1)wg(K)=wf[Wx(0)]wf[Wx(K1)
而其中
∂ f [ W x ( k ) ] ∂ w = [ ∂ f [ w 1 T x ( k ) ] ∂ w ⋮ ∂ f [ w N T x ( k ) ] ∂ w ] N × N 2 = [ f ′ [ w 1 T x ( k ) ] x 1 ( k ) … f ′ [ w 1 T x ( k ) ] x N ( k ) 0 … 0 … 0 f ′ [ w 2 T x ( k ) ] x 1 ( k ) … f ′ [ w 2 T x ( k ) ] x N ( k ) 0 … ⋮ ] = [ f ′ [ w 1 T x ( k ) ] f ′ [ w 2 T x ( k ) ] ⋱ f ′ [ w N T x ( k ) ] ] N × N [ x T ( k ) x T ( k ) ⋱ x T ( k ) ] N × N 2 ≜ D ( k ) X ( k ) \begin{aligned} &\frac{\partial f[Wx(k)]}{\partial w} \\ &= \begin{bmatrix} \frac{\partial f[w_1^Tx(k)]}{\partial w}\\ \vdots \\ \frac{\partial f[w_N^Tx(k)]}{\partial w} \end{bmatrix}_{N\times N^2} \\\\ &= \begin{bmatrix} f'[w_1^Tx(k)]x_{1}(k) & \ldots & f'[w_1^Tx(k)]x_{N}(k) & 0 &\ldots \\ 0 & \ldots & 0 & f'[w_2^Tx(k)]x_{1}(k) & \ldots & f'[w_2^Tx(k)]x_{N}(k) & 0& \ldots\\ \vdots \end{bmatrix} \\\\ &= \begin{bmatrix} f'[w_1^Tx(k)] &&& \\ & f'[w_2^Tx(k)] \\ && \ddots & \\ &&& f'[w_N^Tx(k)] \end{bmatrix}_{N\times N} \begin{bmatrix} x^T(k) &&& \\ & x^T(k)&& \\ && \ddots & \\ &&& x^T(k) \end{bmatrix}_{N\times N^2} \\\\ &\triangleq D(k) X(k) \end{aligned} wf[Wx(k)]=wf[w1Tx(k)]wf[wNTx(k)]N×N2=f[w1Tx(k)]x1(k)0f[w1Tx(k)]xN(k)00f[w2Tx(k)]x1(k)f[w2Tx(k)]xN(k)0=f[w1Tx(k)]f[w2Tx(k)]f[wNTx(k)]N×NxT(k)xT(k)xT(k)N×N2D(k)X(k)
其中
X ( k ) ≜ [ x T ( k ) x T ( k ) ⋱ x T ( k ) ] N × N 2 X(k) \triangleq\begin{bmatrix} x^T(k) &&& \\ & x^T(k)&& \\ && \ddots & \\ &&& x^T(k) \end{bmatrix}_{N\times N^2} X(k)xT(k)xT(k)xT(k)N×N2

∂ g ∂ w = [ D ( 0 ) X ( 0 ) D ( 1 ) X ( 1 ) ⋮ D ( K − 1 ) X ( K − 1 ) ] \frac{\partial g}{\partial w} = \begin{bmatrix} D(0)X(0)\\ D(1)X(1) \\ \vdots \\ D(K-1)X(K-1) \end{bmatrix} wg=D(0)X(0)D(1)X(1)D(K1)X(K1)

参考文献

  • New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence

在这里插入图片描述
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值