随时间反向传播算法(BPTT)笔记

随时间反向传播算法(BPTT)笔记

1.反向传播算法(BP)

以表达式 f ( w , x ) = 1 1 + e − ( w 0 x 0 + w 1 x 1 + w 2 ) f(w,x)=\frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}} f(w,x)=1+e(w0x0+w1x1+w2)1为例,其涉及到的运算操作及导数公式如下:

f ( x ) = 1 x → d f d x = − 1 x 2 f c ( x ) = c + x → d f d x = 1 f ( x ) = e x → d f d x = e x f a ( x ) = a x → d f d x = a (1) \begin{aligned}&f(x)=\frac{1}{x}&\rightarrow&\frac{df}{dx}=-\frac{1}{x^2}\\&f_c(x)=c+x&\rightarrow&\frac{df}{dx}=1\\&f(x)=e^x&\rightarrow&\frac{df}{dx}=e^x\\&f_a(x)=ax&\rightarrow&\frac{df}{dx}=a\end{aligned}\tag{1} f(x)=x1fc(x)=c+xf(x)=exfa(x)=axdxdf=x21dxdf=1dxdf=exdxdf=a(1)
表达式 f ( w , x ) f(w,x) f(w,x)反向传播过程如下图所示:

反向传播过程

其中绿色数值表示表达式 f ( w , x ) f(w,x) f(w,x)正向传播结果,红色数值表示梯度反向传播结果。对于单输入节点(如常数加法或指数运算等)梯度反向传播计算公式如下:

g k = g k + 1 ⋅ d f k d x ∣ x = v k (2) g_k=g_{k+1}\cdot\frac{df_k}{dx}|_{x=v_k}\tag{2} gk=gk+1dxdfkx=vk(2)
其中 g k g_k gk表示节点前梯度, g k + 1 g_{k+1} gk+1表示节点后梯度, f k f_k fk表示节点函数, v k v_k vk表示节点输入。对于加法节点梯度反向传播后各链路数值不变。对于乘法节点梯度反向传播计算公式如下:

g k = g k + 1 ⋅ a (3) g_k=g_{k+1}\cdot a\tag{3} gk=gk+1a(3)
其中 a a a表示节点另一条链路的输入。

2.随时间反向传播(BPTT)

2.1 RNN网络结构

RNN结构

经典RNN结构如上图所示,其正向传播公式如下:

s t = U h t − 1 + W x t h t = tanh ⁡ ( s t ) z t = V h t y ^ t = softmax ⁡ ( z t ) E t = − y t T log ⁡ ( y ^ t ) E = ∑ t = 1 T E t (4) \begin{aligned}s_t&=Uh_{t-1}+Wx_t\\h_t&=\operatorname{tanh}(s_t)\\z_t&=Vh_t\\\hat{y}_t&=\operatorname{softmax}(z_t)\\E_t&=-y_t^T\log(\hat{y}_t)\\E&=\sum_{t=1}^{T}E_t\end{aligned} \tag{4} sthtzty^tEtE=Uht1+Wxt=tanh(st)=Vht=softmax(zt)=ytTlog(y^t)=t=1TEt(4)

2.2 反向传播

2.2.1 计算 ∂ E t ∂ V \frac{\partial E_t}{\partial V} VEt

∂ E t ∂ V i j = ∂ z t ∂ V i j ∂ E t ∂ z t = tr ⁡ [ ( ∂ E t ∂ z t ) T ⋅ ∂ z t ∂ V i j ] = tr ⁡ [ ( y ^ t − y t ) T ⋅ [ 0 ⋮ ∂ z t ( i ) ∂ V i j ⋮ 0 ] ] = ( y ^ t − y t ) ( i ) h t ( j ) (5) \begin{aligned}\frac{\partial E_t}{\partial V_{ij}}&=\frac{\partial z_t}{\partial V_{ij}}\frac{\partial E_t}{\partial z_t}\\&=\operatorname{tr}[(\frac{\partial E_t}{\partial z_t})^T\cdot\frac{\partial z_t}{\partial V_{ij}}]\\&=\operatorname{tr}[(\hat{y}_t-y_t)^T\cdot\begin{bmatrix}0\\\vdots\\\frac{\partial z_t^{(i)}}{\partial V_{ij}}\\\vdots\\0\end{bmatrix}]\\&=(\hat{y}_t-y_t)^{(i)}h_t^{(j)}\end{aligned}\tag{5} VijEt=VijztztEt=tr[(ztEt)TVijzt]=tr[(y^tyt)T0Vijzt(i)0]=(y^tyt)(i)ht(j)(5)

对矩阵 V V V而言,其求导结果如下:

∂ E t ∂ V = ( y ^ t − y t ) ⨂ h t (6) \frac{\partial E_t}{\partial V}=(\hat{y}_t-y_t)\bigotimes h_t \tag{6} VEt=(y^tyt)ht(6)
其中 ⨂ \bigotimes 表示向量外积。

2.2.2 计算 ∂ E t ∂ U \frac{\partial E_t}{\partial U} UEt

∂ E t ∂ U i j = ∑ k = 0 t ∂ s k ∂ U i j ∂ E t ∂ s k = ∑ k = 0 t tr ⁡ [ ( ∂ E t ∂ s k ) T ∂ s k ∂ U i j ] = ∑ k = 0 t tr ⁡ [ ( δ k ) T ∂ s k ∂ U i j ] = ∑ k = 0 t δ k ( i ) h k − 1 ( j ) (7) \frac{\partial E_t}{\partial U_{ij}}=\sum_{k=0}^t\frac{\partial s_k}{\partial U_{ij}}\frac{\partial E_t}{\partial s_k}=\sum_{k=0}^{t}\operatorname{tr}[(\frac{\partial E_t}{\partial s_k})^T\frac{\partial s_k}{\partial U_{ij}}]=\sum_{k=0}^{t}\operatorname{tr}[(\delta_k)^T\frac{\partial s_k}{\partial U_{ij}}]=\sum_{k=0}^t\delta_k^{(i)}h_{k-1}^{(j)}\tag{7} UijEt=k=0tUijskskEt=k=0ttr[(skEt)TUijsk]=k=0ttr[(δk)TUijsk]=k=0tδk(i)hk1(j)(7)

δ k \delta_k δk应用链式法则:

δ k = ∂ h k ∂ s k ∂ s k + 1 ∂ h k ∂ E t ∂ s k + 1 = diag ⁡ ( 1 − h k h k ) U T δ k + 1 = ( U T δ k + 1 ) ( 1 − h k h k ) (8) \delta_k=\frac{\partial h_k}{\partial s_k}\frac{\partial s_{k+1}}{\partial h_k}\frac{\partial E_t}{\partial s_{k+1}}=\operatorname{diag}(1-h_kh_k)U^T\delta_{k+1}=(U^T\delta_{k+1})(1-h_kh_k)\tag{8} δk=skhkhksk+1sk+1Et=diag(1hkhk)UTδk+1=(UTδk+1)(1hkhk)(8)
对矩阵 U U U而言,其求导结果如下:

∂ E t ∂ U = ∑ k = 0 t δ k ⨂ h k − 1 (9) \frac{\partial E_t}{\partial U}=\sum_{k=0}^{t}\delta_k\bigotimes h_{k-1}\tag{9} UEt=k=0tδkhk1(9)

2.2.3 计算 ∂ E t ∂ W \frac{\partial E_t}{\partial W} WEt

按上述思路,对矩阵 W W W而言,其求导结果如下:

∂ E t ∂ W = ∑ k = 0 t δ k ⨂ x k (10) \frac{\partial E_t}{\partial W}=\sum_{k=0}^{t}\delta_k\bigotimes x_k\tag{10} WEt=k=0tδkxk(10)

2.2.4 参数更新

V : = V − λ ∑ t = 0 T ( y ^ t − y t ) ⨂ h t U : = U − λ ∑ t = 0 T ∑ k = 0 t δ k ⨂ h k − 1 W : = W − λ ∑ t = 0 T ∑ k = 0 t δ k ⨂ x k (11) V:=V-\lambda\sum_{t=0}^T(\hat{y}_t-y_t)\bigotimes h_t\\U:=U-\lambda\sum_{t=0}^T\sum_{k=0}^t\delta_k\bigotimes h_{k-1}\\W:=W-\lambda\sum_{t=0}^T\sum_{k=0}^t\delta_k\bigotimes x_k\tag{11} V:=Vλt=0T(y^tyt)htU:=Uλt=0Tk=0tδkhk1W:=Wλt=0Tk=0tδkxk(11)

2.3 长期依赖问题

重新考查梯度 ∂ E t ∂ W \frac{\partial E_t}{\partial W} WEt

∂ E t ∂ W = ∑ k = 0 t ∂ E t ∂ s t ( ∏ j = k + 1 t ∂ s j ∂ s j − 1 ) ∂ s k ∂ W (12) \frac{\partial E_t}{\partial W}=\sum_{k=0}^{t}\frac{\partial E_t}{\partial s_t}(\prod_{j=k+1}^t\frac{\partial s_j}{\partial s_{j-1}})\frac{\partial s_k}{\partial W}\tag{12} WEt=k=0tstEt(j=k+1tsj1sj)Wsk(12)
由于 tanh ⁡ \operatorname{tanh} tanh导数取值范围为(0,1],因此Jacobian矩阵 ∂ s j ∂ s j − 1 \frac{\partial s_j}{\partial s_{j-1}} sj1sj上限为1。Jacobian矩阵多次连乘后,矩阵上限呈指数下降,最终几乎完全消失,这样就导致了远离 T T T时刻的梯度为0,这些时刻的状态对学习过程没有帮助,因此RNN结构无法解决长期依赖问题。

参考文献

[1]CS231n Convolutional Neural Networks for Visual Recognition

[2]随时间反向传播 (BackPropagation Through Time,BPTT)

  • 1
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值