循环神经网络RNN详解反向传播公式推导+代码（十分详细）

最新推荐文章于 2024-06-26 19:42:33 发布

MrTriste

最新推荐文章于 2024-06-26 19:42:33 发布

阅读量1.6w

点赞数 4

分类专栏：机器学习深度学习神经网络循环神经网络文章标签： RNN 机器学习深度学习神经网络循环神经网络

本文链接：https://blog.csdn.net/wjc1182511338/article/details/79191099

版权

机器学习同时被 3 个专栏收录

8 篇文章 1 订阅

订阅专栏

深度学习

5 篇文章 0 订阅

订阅专栏

神经网络

2 篇文章 0 订阅

订阅专栏

部分内容引用自https://zybuluo.com/hanbingtao/note/541458

1. Why RNN

循环神经网络

RNN为语言模型来建模，语言模型就是：给定一个一句话前面的部分，预测接下来最有可能的一个词是什么。

RNN理论上可以往前看(往后看)任意多个词。

2. RNN结构

2.1 最基本的结构：

$x_{t-1},x_t,x_{t+1}$ 是输入的连续一句话里的单词， $o_{t-1},o_t,o_{t+1}$ 是对应单词的输出概率，s是神经元。

$U, V, W$ 是权重矩阵，f，g是激活函数。
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$
这个网络在t时刻接收到输入 $x_t$ 之后，隐藏层的值是 $s_t$ ，输出值是 $o_t$ 。关键一点是， $s_t$ 的值不仅仅取决于 $x_t$ ，还取决于 $x_{t-1}$ 。

展开就是：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$
每一层的W是相同的，每一层的U是相同的。

接下来我们在此结构上进行反向传播讲解。

(2.2 加入双向循环)

-> 双向循环神经网络

区别就是输出 $o_t$ 不仅依赖正向的神经元（ $A_t$ 位置），还依赖于反向计算的神经元（ $A_t^{'}$ 位置）。
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$

(2.3 加入多层)

（即黄色的部分从1层神经元变成3层神经元） -> 深度循环神经网络

3. 训练

Backpropagation through time (BPTT)

我们对最基本的结构即2.1里提到的进行反向传播。

3.0 设定

整个神经网络有三个参数， $V, W, U$ ，其中 $W 和 U$ 的推导十分类似，我们主要推导 $V, W$ ，U会说明下。

参考了Recurrent Neural Networks Tutorial, Part 3 以及pdf

PDF里用到了Einstein Summation，其实很简单，就是省略了求和符号，如下
$\frac{\partial E_t}{\partial V_{ij}}=\sum_m \frac{\partial E_t}{\partial O_{t_m}} \frac{\partial O_{t_m}}{\partial V_{ij}}= \frac{\partial E_t}{\partial O_{t_m}} \frac{\partial O_{t_m}}{\partial V_{ij}}$
其中m是哑变量（dummy index），我们可以省略对m求和的符号，这就是Einstein Summation。

下面的求导我们不用Einstein Summation，为了好理解，但是用这个确实简洁点。
各变量的维度：

$V:m*n\\ x_t:m*1\\ s_t:n*1\\ U:n*m\\ W:n*n\\ y:m*1\quad真实label\\ \hat{y}:m*1\quad概率$

误差如下：
$E=\sum_t E_t$
我们对每个误差分别求导，再相加。
时间长度为 $T$ ，t从0到 $t - 1$

3.1 对V求导

等式
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
对 $V_{ij}$ 求导：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()第一项：
$\frac{\partial E_t}{\partial \hat{y_{t_k}}}=-y_{t_k}*\frac{1}{\hat{y_{t_k}}}$
()第二项：
$KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ \frac{\partial…$
前两项合并：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
(*)第三项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
将(**)与(***)合并：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
所以：
$\frac{\partial E_t}{\partial V}=(\hat{y_{t}}-y_t)\otimes s_t$

3.2 对W求导

等式：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
同对 $V_{ij}$ 求导，对 $W_{ij}$ 求导：
$\frac{\partial E_t}{\partial W_{ij}}=\sum_k \sum_l \sum_m(\frac{\partial E_t}{\partial \hat{y_{t_k}}} \frac{\partial \hat{y_{t_k}}}{\partial q_{t_l}} \frac{\partial q_{t_l}}{\partial s_{t_m}} \frac{\partial s_{t_m}}{\partial W_{ij}} ) \quad (*)$

()的前两项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()的第三项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()的第四项：（ $s_{t_m}$ 依赖于 $s_0-s_{t-1}$ ， $s_t=tanh(Ux_t+Ws_{t-1})$ ）
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
所以()可以表示为：
$\frac{\partial E_t}{\partial W_{ij}}=\sum_l \{(\hat{y_{t_l}}-y_{t_l})\sum_m[ V_{lm} \sum_{r=0}^t (\frac{\partial s_{t_m}}{\partial s_{r_n}} \frac{\partial s_{r_n}}{\partial W_{ij}})]\}$

3.2.0 代码：

针对以上的推导，可以下面的反向传播代码：

其中：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ o&:\hat{y_t}&,…$

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]: # t:(T-1)->0
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation: dL/dz
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]: # bptt_step:t->...
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            # Add to gradients at each previous step
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step dL/dz at t-1
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

3.2.1 delta_t的解释

代码里的dLdW += np.outer(delta_t, s[bptt_step-1])实现(****)这个等式，第一项和后面的若干项是分开的。

下面具体解释：

(****)的第一项：

$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$

其与(**) 、(***)结合， $\frac{\partial E_t}{\partial W_{ij}}$ 第一项则为：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$
其中 $\sum_l \{(\hat{y_{t_l}}-y_{t_l}) V_{li}\}$ 就是V的第 $l$ 列与 $(\hat{y_{t}}-y_{t})$ 的内积（代码用V的转置乘以delta_o实现）。

delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) 就是实现$ (1-s_{t_i}^2) *\sum_l {(\hat{y_{t_l}}-y_{t_l}) V_{li}}$

(****)的第2项：

首先我们要推导：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
然后第二项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
同第一项的步骤，与(**) 、(***)结合， $\frac{\partial E_t}{\partial W_{ij}}$ 第二项则为：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$

其中系数 $s_{{t-2}_j}$ 由代码dLdW += np.outer(delta_t, s[bptt_step-1]) 实现。
下面我们解释为什么剩下的由代码delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)实现。

2.1 不难理解 $1-s_{{t-1}_i}^2)$ 对应代码(1 - s[bptt_step-1] ** 2).

2.2 那么为什么$\sum_l {(\hat{y_{t_l}}-y_{t_l})\sum_m [V_{lm} (1-s_{t_m}^2)W_{mi} ]} $ 可以由上一次的delta_t直接乘以W呢？

我们观察下第一次的delta_t的第i个元素：$ (1-s_{t_i}^2) *\sum_l {(\hat{y_{t_l}}-y_{t_l}) V_{li}} $

self.W.T.dot(delta_t)的第k个元素是W的第k列.dot(delta)，即
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \sum_{d=1}^n (…$

(****)的第3项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$
同样可以由上一步的delta乘以W得到，证明类似。

3.3 对U求导

与W十分类似。

等式：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
同对 $V_{ij}$ 求导，对 $W_{ij}$ 求导：
$\frac{\partial E_t}{\partial U_{ij}}=\sum_k \sum_l \sum_m(\frac{\partial E_t}{\partial \hat{y_{t_k}}} \frac{\partial \hat{y_{t_k}}}{\partial q_{t_l}} \frac{\partial q_{t_l}}{\partial s_{t_m}} \frac{\partial s_{t_m}}{\partial U_{ij}} ) \quad (*)$
我们只要看第四项：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
与 $\frac{\partial s_{t_m}}{\partial W_{ij}}$ 的第一项基本一样，除了最后的 $x_{t_j}$ ，

所以 $\frac{\partial E_t}{\partial U_{ij}}$ 为：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) 实现的是 $(1-s_{t_i}^2) *\sum_l \{(\hat{y_{t_l}}-y_{t_l}) V_{li}\}$ .

dLdU[:,x[bptt_step]] += delta_t 实现的是 $x_{t_j}$ ，因为 $x_t$ 的取值只为0或1，所以只要在dLdU的 $x_t$ 不为0的那列加上delta_t即可。

MrTriste

关注

4
点赞
踩
34

收藏

觉得还不错? 一键收藏
2
评论
循环神经网络RNN详解反向传播公式推导+代码（十分详细）

部分内容引用自https://zybuluo.com/hanbingtao/note/5414581. Why RNN循环神经网络RNN为语言模型来建模，语言模型就是：给定一个一句话前面的部分，预测接下来最有可能的一个词是什么。RNN理论上可以往前看(往后看)任意多个词。2. RNN结构2.1 最基本的结构：xt&#x2212;1,xt,xt+1" role
复制链接

扫一扫