吴恩达深度学习_5_Week1循环序列模型：递归神经网络

最新推荐文章于 2024-09-12 23:40:25 发布

爪娃侠

最新推荐文章于 2024-09-12 23:40:25 发布

阅读量83

点赞数

分类专栏：吴恩达深度学习文章标签：深度学习神经网络人工智能

本文链接：https://blog.csdn.net/zxy0000zxy/article/details/134385410

版权

吴恩达深度学习专栏收录该内容

23 篇文章 2 订阅

订阅专栏

逐步构建递归神经网络

1、基本递归神经网络的前向传播
2、长短期记忆（LSTM）网络
3、循环神经网络中的反向传播

第五门课：序列模型
第一周：循环序列模型

递归神经网络（RNN）在自然语言处理和其他序列任务中非常有效，因为它们具有“记忆”。它们可以逐个读取输入𝑥⟨𝑡⟩（例如单词），并通过隐藏层激活传递从一个时间步到下一个时间步的一些信息/上下文。这使得单向RNN能够从过去获取信息以处理后续的输入。双向RNN可以从过去和未来获取上下文。
符号：
1、上标[𝑙]表示与第𝑙层相关联的对象。
示例：𝑎[4] 是第4层的激活。𝑊[5] 和 𝑏[5] 是第5层的参数。
2、上标(𝑖)表示与第𝑖个示例相关联的对象。
示例：𝑥(𝑖) 是第𝑖个训练示例的输入。
3、上标⟨𝑡⟩表示第𝑡个时间步的对象。
示例：𝑥⟨𝑡⟩ 是第𝑡个时间步的输入x。𝑥(𝑖)⟨𝑡⟩ 是第𝑡个时间步的第𝑖个示例的输入。
4、小写𝑖表示向量的第𝑖个元素。
示例：𝑎[𝑙]𝑖 表示第𝑙层激活的第𝑖个元素。

import numpy as np
from rnn_utils import *

一、基本递归神经网络的前向传播

您将使用RNN生成音乐。您将实现的基本RNN结构如下所示。在此示例中，𝑇𝑥=𝑇𝑦。
在这里插入图片描述
步骤：
1、实现RNN一个时间步骤所需的计算。
2、实现对 𝑇𝑥 个时间步骤的循环，以便逐个处理所有输入。

1、RNN单元

递归神经网络可以看作是一个单个单元的重复。首先，您将实现一个单个时间步骤的计算。下图描述了RNN单元的单个时间步骤的操作。
在这里插入图片描述

按照图（2）中的描述，实现RNN单元的单个前向步骤
参数：
xt -- 时间步骤"t"的输入数据，形状为(n_x, m)的numpy数组。
a_prev -- 时间步骤"t-1"的隐藏状态，形状为(n_a, m)的numpy数组。
parameters -- 包含以下内容的Python字典：
                    Wax -- 输入的权重矩阵，形状为(n_a, n_x)的numpy数组。
                    Waa -- 隐藏状态的权重矩阵，形状为(n_a, n_a)的numpy数组。
                    Wya -- 将隐藏状态与输出相关联的权重矩阵，形状为(n_y, n_a)的numpy数组。
                    ba -- 偏置，形状为(n_a, 1)的numpy数组。
                    by -- 将隐藏状态与输出相关联的偏置，形状为(n_y, 1)的numpy数组。
返回：
a_next -- 下一个隐藏状态，形状为(n_a, m)的numpy数组。
yt_pred -- 时间步骤"t"的预测值，形状为(n_y, m)的numpy数组。
cache -- 反向传播所需的值的元组，包含(a_next, a_prev, xt, parameters)。

# GRADED FUNCTION: rnn_cell_forward
def rnn_cell_forward(xt, a_prev, parameters): 
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)

    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_next) + by) 
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache
    
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)
print("a_next[4] = ", a_next[4])
print("a_next.shape = ", a_next.shape)
print("yt_pred[1] =", yt_pred[1])
print("yt_pred.shape = ", yt_pred.shape)

在这里插入图片描述

2、RNN正向传播

你可以将RNN看作刚刚构建的单个单元的重复。如果您的输入数据序列在10个时间步骤上进行传递，那么您将复制RNN单元10次。每个单元以前一个单元的隐藏状态 ( 𝑎⟨𝑡−1⟩ ) 和当前时间步骤的输入数据 ( 𝑥⟨𝑡⟩ ) 作为输入。它输出一个隐藏状态 ( 𝑎⟨𝑡⟩ ) 和这个时间步骤的预测值 ( 𝑦⟨𝑡⟩ ) 。
在这里插入图片描述
图3：基本RNN。输入序列 𝑥=(𝑥⟨1⟩,𝑥⟨2⟩,…,𝑥⟨𝑇𝑥⟩) 在 𝑇𝑥 个时间步骤上进行传递。网络输出为 𝑦=(𝑦⟨1⟩,𝑦⟨2⟩,…,𝑦⟨𝑇𝑥⟩) 。

练习：编写图（3）中描述的RNN的前向传播代码。
指导：
1、创建一个全零向量 ( 𝑎 ) ，用于存储RNN计算的所有隐藏状态。
2、将“下一个”隐藏状态初始化为 𝑎0 （初始隐藏状态）。
3、开始循环遍历每个时间步骤，增量索引为 𝑡 ：
通过运行 rnn_cell_forward 更新“下一个”隐藏状态和缓存。
将“下一个”隐藏状态存储在 𝑎 的第 𝑡 个位置。
将预测值存储在 y 中。
将缓存添加到缓存列表中。
4、返回 𝑎 ， 𝑦 和 caches。

实现图（3）中描述的递归神经网络的前向传播。
参数：
x -- 每个时间步骤的输入数据，形状为 (n_x, m, T_x) 的数组。
a0 -- 初始隐藏状态，形状为 (n_a, m) 的数组。
parameters -- 包含以下内容的Python字典：
                    Waa -- 隐藏状态的权重矩阵，形状为 (n_a, n_a) 的numpy数组。
                    Wax -- 输入的权重矩阵，形状为 (n_a, n_x) 的numpy数组。
                    Wya -- 将隐藏状态与输出相关联的权重矩阵，形状为 (n_y, n_a) 的numpy数组。
                    ba -- 偏置，形状为 (n_a, 1) 的numpy数组。
                    by -- 将隐藏状态与输出相关联的偏置，形状为 (n_y, 1) 的numpy数组。
返回：
a -- 每个时间步骤的隐藏状态，形状为 (n_a, m, T_x) 的numpy数组。
y_pred -- 每个时间步骤的预测值，形状为 (n_y, m, T_x) 的numpy数组。
caches -- 反向传播所需的值的元组，包含(缓存列表，x)。

# GRADED FUNCTION: rnn_forward
def rnn_forward(x, a0, parameters):      
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    # initialize "a" and "y" with zeros (≈2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))

    # Initialize a_next (≈1 line)
    a_next = a0

    # loop over all time-steps
    for t in range(T_x):

        # Update next hidden state, compute the prediction, get the cache (≈1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:, :, t], a_next, parameters)

        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next

        # Save the value of the prediction in y (≈1 line)
        y_pred[:,:,t] = yt_pred

        # Append "cache" to "caches" (≈1 line)
        caches.append(cache)

    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches
    
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a, y_pred, caches = rnn_forward(x, a0, parameters)
print("a[4][1] = ", a[4][1])
print("a.shape = ", a.shape)
print("y_pred[1][3] =", y_pred[1][3])
print("y_pred.shape = ", y_pred.shape)
print("caches[1][1][3] =", caches[1][1][3])
print("len(caches) = ", len(caches))

在这里插入图片描述
您已成功从头构建了递归神经网络的前向传播。对于某些应用来说，这已经足够好了，但它存在梯度消失的问题。因此，当每个输出 𝑦⟨𝑡⟩ 主要可以使用“局部”上下文（即来自输入 𝑥⟨𝑡′⟩，其中 𝑡′ 与 𝑡 不相差太远的信息）进行估计时，它的效果最好。
在下一部分中，您将构建一个更复杂的LSTM模型，它更好地解决了梯度消失的问题。LSTM能够更好地记住一段信息，并在多个时间步骤中保留它。

二、长短期记忆（LSTM）网络

下图显示了LSTM单元的操作。
在这里插入图片描述
与上面的RNN示例类似，您将首先实现LSTM单元的单个时间步骤。然后，您可以在for循环内迭代调用它，以便它处理具有 𝑇𝑥 个时间步的输入。

关于门控

遗忘门

为了说明问题，假设我们正在阅读一段文字中的单词，并且希望使用LSTM来跟踪语法结构，例如主语是单数还是复数。如果主语从单数变为复数，我们需要找到一种方法来清除之前存储的单数/复数状态的记忆值。在LSTM中，遗忘门让我们能够做到这一点：
在这里插入图片描述
这里，𝑊𝑓是控制遗忘门行为的权重。我们将 [𝑎⟨𝑡−1⟩,𝑥⟨𝑡⟩] 进行连接并乘以 𝑊𝑓。上述方程的结果是一个取值介于0和1之间的向量 Γ⟨𝑡⟩𝑓。这个遗忘门向量将与先前的细胞状态 𝑐⟨𝑡−1⟩ 逐元素相乘。因此，如果 Γ⟨𝑡⟩𝑓 的某个值为0（或接近0），则意味着LSTM应该在 𝑐⟨𝑡−1⟩ 的相应分量中删除该信息（例如单数主语）。如果某个值为1，则会保留该信息。

更新门

一旦我们忘记了正在讨论的主语是单数，我们需要找到一种方法来更新它，以反映新的主语现在是复数。以下是更新门的公式：
在这里插入图片描述
与遗忘门类似，这里的 Γ⟨𝑡⟩𝑢 再次是一个取值介于0和1之间的向量。它将与 𝑐̃ ⟨𝑡⟩ 逐元素相乘，以计算 𝑐⟨𝑡⟩。

更新细胞状态

为了更新新的主题，我们需要创建一个新的数字向量，可以将其添加到先前的细胞状态中。我们使用的方程式是：
在这里插入图片描述
最后，新的细胞状态为：

输出门

为了决定我们将使用哪些输出，我们将使用以下两个公式：
在这里插入图片描述
在公式5中，您可以使用sigmoid函数来决定输出内容，在公式6中，您将其乘以前一个状态的tanh。

1、LSTM细胞

在这里插入图片描述

实现LSTM单元的单个前向步骤如图（4）所述：
参数：
    xt -- 时间步长 “t” 的输入数据，形状为 （n_x， m） 的 numpy 数组。
    a_prev -- 时间步长 “t-1” 的隐藏状态，形状为 （n_a， m） 的 numpy 数组
    c_prev -- 时间步长 “t-1” 的内存状态，形状为 （n_a， m） 的 numpy 数组
    参数 -- Python 字典包含：
                        Wf -- 遗忘门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                        bf -- 遗忘门的偏差，形状的 numpy 数组 （n_a， 1）
                        Wi -- 更新门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                        bi -- 更新门的偏差，形状为 numpy 数组 （n_a， 1）
                        Wc -- 第一个“tanh”的权重矩阵，形状的numpy数组（n_a，n_a+n_x）
                        bc -- 第一个“tanh”的偏差，形状的numpy数组（n_a，1）
                        Wo -- 输出门的权重矩阵，形状为 （n_a， n_a + n_x） 的 numpy 数组
                        bo -- 输出门的偏置，形状为 numpy 数组 （n_a， 1）
                        Wy -- 将隐藏状态与输出的 numpy 数组（n_y、n_a）相关的权重矩阵
                        by -- 将隐藏状态与输出 numpy 数组（n_y， 1） 相关联的偏差                        
返回：
    a_next -- 下一个隐藏状态，形状为 （n_a， m）
    c_next -- 下一个内存状态，形状为 （n_a， m）
    yt_pred -- 预测时间步长 “t”， 形状为 （n_y， m） 的 numpy 数组
    cache -- 向后传递所需的值的元组，包含（a_next、c_next、a_prev、c_prev、xt、参数）
注意：ft/it/ot 代表忘记/更新/输出门，cct 代表候选值（c 波浪号）， C 代表内存值

# GRADED FUNCTION: lstm_cell_forward
def lstm_cell_forward(xt, a_prev, c_prev, parameters):

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    # Concatenate a_prev and xt (≈3 lines)
    concat = np.zeros((n_x + n_a, m))
    concat[: n_a, :] = a_prev  
    concat[n_a :, :] = xt 

    # Compute values for ft, it, cct, c_next, ot, a_next using the formulas given figure (4) 
    ft = sigmoid(np.dot(Wf, concat) + bf)
    it = sigmoid(np.dot(Wi, concat) + bi)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    c_next = ft*c_prev + it*cct
    ot = sigmoid(np.dot(Wo, concat) + bo)
    a_next = ot*np.tanh(c_next)

    # Compute prediction of the LSTM cell (≈1 line)
    yt_pred = softmax(np.dot(Wy, a_next) + by)

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
c_prev = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)
print("a_next[4] = ", a_next[4])
print("a_next.shape = ", c_next.shape)
print("c_next[2] = ", c_next[2])
print("c_next.shape = ", c_next.shape)
print("yt[1] =", yt[1])
print("yt.shape = ", yt.shape)
print("cache[1][3] =", cache[1][3])
print("len(cache) = ", len(cache))

在这里插入图片描述

2、LSTM 的前向传递

现在，您已经实现了 LSTM 的一个步骤，现在可以使用 for 循环对其进行迭代，以处理 Tx 序列输入。
在这里插入图片描述

使用图 （3） 中描述的 LSTM 单元实现递归神经网络的前向传播。
参数：
    x -- 输入形状为（n_x、m、T_x）的每个时间步的数据。
    a0 -- 初始隐藏状态，形状为 （n_a， m）
    参数 -- Python 字典包含：
                        Wf -- 遗忘门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                        bf -- 遗忘门的偏差，形状的 numpy 数组 （n_a， 1）
                        Wi -- 更新门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                        bi -- 更新门的偏差，形状为 numpy 数组 （n_a， 1）
                        Wc -- 第一个“tanh”的权重矩阵，形状的numpy数组（n_a，n_a+n_x）
                        bc -- 第一个“tanh”的偏差，形状的numpy数组（n_a，1）
                        Wo -- 输出门的权重矩阵，形状为 （n_a， n_a + n_x） 的 numpy 数组
                        bo -- 输出门的偏置，形状为 numpy 数组 （n_a， 1）
                        Wy -- 将隐藏状态与输出的 numpy 数组（n_y、n_a）相关的权重矩阵
                        by -- 将隐藏状态与输出 numpy 数组（n_y， 1） 相关联的偏差                        
返回：
    a -- 每个时间步长的隐藏状态，形状为 numpy 数组（n_a、m、T_x）
    y -- 对每个时间步长的预测，形状为 numpy 数组（n_y、m、T_x）
    caches -- 向后传递所需的值的元组，包含（所有缓存的列表，x）

# GRADED FUNCTION: lstm_forward
def lstm_forward(x, a0, parameters):  
    # Initialize "caches", which will track the list of all the caches
    caches = []
    # Retrieve dimensions from shapes of x and parameters['Wy'] (≈2 lines)
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape
        #不能用Wy.shape 因为很容易取到全局变量而非局部变量 
    # initialize "a", "c" and "y" with zeros (≈3 lines)
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))

    # Initialize a_next and c_next (≈2 lines)
    a_next = a0
    c_next = np.zeros((n_a, m))

    # loop over all time-steps
        # Update next hidden state, next memory state, compute the prediction, get the cache
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)

        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next

        # Save the value of the prediction in y (≈1 line)
        y[:,:,t] = yt

        # Save the value of the next cell state (≈1 line)
        c[:,:,t]  = c_next

        # Append the cache into caches (≈1 line)
        caches.append(cache)

    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches

np.random.seed(1)
x = np.random.randn(3,10,7)
a0 = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a, y, c, caches = lstm_forward(x, a0, parameters)
print("a[4][3][6] = ", a[4][3][6])
print("a.shape = ", a.shape)
print("y[1][4][3] =", y[1][4][3])
print("y.shape = ", y.shape)
print("caches[1][1[1]] =", caches[1][1][1])
print("c[1][2][1]", c[1][2][1])
print("len(caches) = ", len(caches))

c[1][2][1] = -0.855544916718
len(caches) =2
a[4][3][6] = 0.172117767533
a.shape = (5, 10, 7)
y[1][4][3] = 0.95087346185
y.shape = (2, 10, 7)
caches[1][1][1] = [ 0.82797464 0.23009474 0.76201118 -0.22232814 -0.20075807 0.18656139 0.41005165]

现在已经实现了基本 RNN 和 LSTM 的前向传递。使用深度学习框架时，实现前向传递足以构建实现出色性能的系统。

三、循环神经网络中的反向传播

在现代深度学习框架中，你只需要实现前向传递，框架会处理后向传递，所以大多数深度学习工程师不需要为后向传递的细节而烦恼。但是，如果您是微积分专家，并且想查看 RNN 中 backprop 的详细信息，您可以完成笔记本的这个可选部分。
在前面的课程中，当您实现一个简单的（完全连接的）神经网络时，您使用反向传播来计算与更新参数的成本相关的导数。同样，在循环神经网络中，您可以计算与成本相关的导数，以便更新参数。反向道具方程非常复杂，我们没有在课堂上推导它们。但是，我们将在下面简要介绍它们。

1、基本 RNN 向后传递

我们将从计算基本 RNN 单元的反向传递开始。
在这里插入图片描述
推导一步后退函数：
1、要计算rnn_cell_backward，您需要计算以下公式。手工推导它们是一个很好的练习。
2、tanh的衍生物是 1−tanh（x）的平方，你可以在这里找到完整的证据。注意：sech（x）2=1−tanh（x）2
3、同样，对于∂a⟨t⟩∂Wax，∂a⟨t⟩∂Waa，∂a⟨t⟩∂b ，tanh（u）的导数是（1−tanh（u）2）du
最后两个方程也遵循相同的规则，并使用 tanh 推导导数。请注意，这种排列方式是为了让相同的尺寸匹配。

实现 RNN 单元的向后传递（单个时间步长）。
参数：
    da_next -- 相对于下一个隐藏状态的损失梯度
    cache -- 包含有用值的 Python 字典（rnn_cell_forward（）） 的输出）
返回：
    梯度 -- Python 字典包含：
                        dx -- 输入数据的梯度，形状为 （n_x， m）
                        da_prev -- 先前隐藏状态的梯度，形状为 （n_a， m）
                        dWax -- 输入到隐藏权重的梯度，形状（n_a、n_x）
                        dWaa -- 隐藏到隐藏权重的梯度，形状（n_a、n_a）
                        dba -- 偏置矢量的梯度，形状 （n_a， 1）

def rnn_cell_backward(da_next, cache):
 
    # Retrieve values from cache
    (a_next, a_prev, xt, parameters) = cache
    
    # Retrieve values from parameters
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    # compute the gradient of tanh with respect to a_next (≈1 line)
    # 根据上面提供的参数 da_next -- Gradient of loss with respect to next hidden state 
    # 以及提到的公式 tanh(u) = (1- tanh(u)**2)*du ，这里du 就是da_next、tanh(u)是a_next
    dtanh = (1-a_next**2)*da_next    # formula 1、2

    # compute the gradient of the loss with respect to Wax (≈2 lines)
    dxt = np.dot(Wax.T, dtanh)    # formula 6
    dWax = np.dot(dtanh, xt.T)    # formula 3

    # compute the gradient with respect to Waa (≈2 lines)
    da_prev = np.dot(Waa.T, dtanh)    # formula 7
    dWaa = np.dot(dtanh, a_prev.T)    # formula 4

    # compute the gradient with respect to b (≈1 line)
    dba = np.sum(dtanh, keepdims=True, axis=-1)    # formula 5

    # Store the gradients in a python dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients
    
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
b = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}

a_next, yt, cache = rnn_cell_forward(xt, a_prev, parameters)

da_next = np.random.randn(5,10)
gradients = rnn_cell_backward(da_next, cache)
print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
print("gradients[\"dba\"][4] =", gradients["dba"][4])
print("gradients[\"dba\"].shape =", gradients["dba"].shape)

在这里插入图片描述

向后通过 RNN

计算成本相对于 a⟨t 的梯度⟩在每个时间步长 t很有用，因为它有助于梯度反向传播到前一个 RNN 单元。为此，您需要遍历从末尾开始的所有时间步骤，并在每个步骤中递增整体 dba、dWaa 、你存储DX
指示：
实现rnn_backward功能。首先用零初始化返回变量，然后在每个时间步长调用rnn_cell_backward时遍历所有时间步长，并相应地更新其他变量。

在整个输入数据序列上实现 RNN 的反向传递。
参数：
    da -- 所有隐藏状态的上游梯度，形状为（n_a、m、T_x）
    caches -- 包含来自前向传递的信息的元组 （rnn_forward）    
返回：
    梯度 -- Python 字典包含：
                        dx -- 输入数据的梯度，形状为 （n_x， m， T_x） 的 numpy-array
                        da0 -- 初始隐藏状态的梯度，形状为 （n_a， m） 的 numpy-array
                        dWax -- 根据输入的权重矩阵进行渐变，形状为 numpy-array （n_a， n_x）
                        dWaa -- Gradient w.r.t 隐藏状态的权重矩阵，numpy-arrayof shape （n_a， n_a）
                        dba -- 形状偏差的梯度 （n_a， 1）

def rnn_backward(da, caches):
  
    # Retrieve values from the first cache (t=1) of caches (≈2 lines)
    (caches, x) = caches
    (a1, a0, x1, parameters) = caches[0]

    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = da.shape
    n_x, m = x1.shape 

    # initialize the gradients with the right sizes (≈6 lines)
    dx = np.zeros((n_x, m, T_x))
    dWax = np.zeros((n_a, n_x))
    dWaa = np.zeros((n_a, n_a))
    dba = np.zeros((n_a, 1))
    da0 = np.zeros((n_a, m))
    da_prevt = np.zeros((n_a, m))

    # Loop through all the time steps
    for t in reversed(range(T_x)):
        # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. 
        gradients = rnn_cell_backward(da[:, :, t] + da_prevt, caches[t])
        # Retrieve derivatives from gradients 
        dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]

        # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
        dx[:, :, t] = dxt
        dWax += dWaxt
        dWaa += dWaat
        dba += dbat
              
    # Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
    da0 = da_prevt

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
    
    return gradients
    
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
a, y, caches = rnn_forward(x, a0, parameters)
da = np.random.randn(5, 10, 4)
gradients = rnn_backward(da, caches)

print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
print("gradients[\"dx\"].shape =", gradients["dx"].shape)
print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
print("gradients[\"da0\"].shape =", gradients["da0"].shape)
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
print("gradients[\"dba\"][4] =", gradients["dba"][4])
print("gradients[\"dba\"].shape =", gradients["dba"].shape)

在这里插入图片描述

2、LSTM 向后传递

1）后退一步

LSTM 向后传递比向前传递要复杂得多。我们在下面为您提供了 LSTM 向后传递的所有方程式。（如果您喜欢微积分练习，请随时尝试自己从头开始推导这些练习。

2）门导数

在这里插入图片描述

3）参数导数

在这里插入图片描述

实现 LSTM 单元的后向传递（单个时间步长）。
参数：
    da_next -- 下一个隐藏状态的梯度，形状 （n_a， m）
    dc_next -- 下一个细胞状态的梯度，形状为 （n_a， m）
    cache -- 缓存存储来自正向传递的信息
返回：
    梯度 -- Python 字典包含：
                     dxt -- 输入数据在时间步长 t 处的梯度，形状为 （n_x， m）
                     da_prev -- 梯度 w.r.t. 前一个隐藏状态，形状为 numpy 数组 （n_a， m）
                     dc_prev -- 梯度与先前的记忆状态，形状（n_a、m、T_x）
                     dWf -- Gradient w.r.t. 遗忘门的权重矩阵，形状的 numpy 数组 （n_a， n_a + n_x）
                     dWi -- 梯度 w.r.t. 更新门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                     dWc -- Gradient w.r.t. 内存门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                     dWo -- 梯度 w.r.t. 输出门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                     dbf -- 遗忘门的梯度偏差，形状 （n_a， 1）
                     dbi -- 更新门的梯度偏差，形状为 （n_a， 1）
                     dbc -- 内存门的梯度偏差，形状为 （n_a， 1）
                     dbo -- 输出门的梯度偏置，形状为 （n_a， 1）

def lstm_cell_backward(da_next, dc_next, cache):
    # Retrieve information from "cache"
    (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache

    # Retrieve dimensions from xt's and a_next's shape (≈2 lines)
    n_x, m = xt.shape
    n_a, m = a_next.shape

    # Compute gates related derivatives, you can find their values can be found by looking carefully at equations (7) to (10) (≈4 lines)

    dot = da_next * np.tanh(c_next) * ot * (1-ot) 
    dcct = (dc_next*it+ot*(1-np.square(np.tanh(c_next)))*it*da_next)*(1-np.square(cct))
    dit = (dc_next*cct+ot*(1-np.square(np.tanh(c_next)))*cct*da_next)*it*(1-it)
    dft = (dc_next*c_prev+ot*(1-np.square(np.tanh(c_next)))*c_prev*da_next)*ft*(1-ft) 

    # Code equations (7) to (10) (≈4 lines)
    #dit = None
    #dft = None
    #dot = None
    #dcct = None

    # Compute parameters related derivatives. Use equations (11)-(14) (≈8 lines)
    dWf = np.dot(dft, np.concatenate((a_prev, xt), axis=0).T)
    dWi = np.dot(dit, np.concatenate((a_prev, xt), axis=0).T)
    dWc = np.dot(dcct, np.concatenate((a_prev, xt), axis=0).T)
    dWo = np.dot(dot, np.concatenate((a_prev, xt), axis=0).T)
    dbf = np.sum(dft, axis=1, keepdims=True)
    dbi = np.sum(dit, axis=1, keepdims=True)
    dbc = np.sum(dcct, axis=1, keepdims=True)
    dbo = np.sum(dot, axis=1, keepdims=True)

    # Compute derivatives w.r.t previous hidden state, previous memory state and input. Use equations (15)-(17). (≈3 lines)
    da_prev = np.dot(parameters['Wf'][:,:n_a].T, dft) + np.dot(parameters['Wi'][:,:n_a].T, dit) + np.dot(parameters['Wc'][:,:n_a].T, dcct) + np.dot(parameters['Wo'][:,:n_a].T, dot)
    dc_prev = dc_next*ft + ot*(1-np.square(np.tanh(c_next)))*ft*da_next
    dxt = np.dot(parameters['Wf'][:,n_a:].T,dft)+np.dot(parameters['Wi'][:,n_a:].T,dit)+np.dot(parameters['Wc'][:,n_a:].T,dcct)+np.dot(parameters['Wo'][:,n_a:].T,dot) 

    # Save gradients in dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}

    return gradients
    
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
c_prev = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)

da_next = np.random.randn(5,10)
dc_next = np.random.randn(5,10)
gradients = lstm_cell_backward(da_next, dc_next, cache)
print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
print("gradients[\"dc_prev\"][2][3] =", gradients["dc_prev"][2][3])
print("gradients[\"dc_prev\"].shape =", gradients["dc_prev"].shape)
print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)

在这里插入图片描述

3、向后通过 LSTM RNN

这部分与您上面实现的 rnn_backward 函数非常相似。首先，您将创建与返回变量具有相同维度的变量。然后，您将遍历从末尾开始的所有时间步长，并在每次迭代时调用为 LSTM 实现的一步函数。然后，您将通过对参数进行单独求和来更新参数。最后返回一个带有新渐变的字典。
说明：实现lstm_backward功能。创建一个从 Tx 开始的 for 循环并倒退。对于每个步骤，调用 lstm_cell_backward 并通过向旧渐变添加新渐变来更新旧渐变。请注意，dxt 不会更新，但会存储。

使用 LSTM-cell 实现 RNN 的向后传递（在整个序列上）。
参数：
    da -- 隐藏状态的梯度，形状的 numpy-array （n_a， m， T_x）
    dc -- 内存状态的梯度，形状为 numpy-array （n_a， m， T_x）
    caches -- 缓存存储来自前向传递的信息 （lstm_forward）
返回：
    梯度 -- Python 字典包含：
                      dx -- 输入的梯度，形状（n_x、m、T_x）
                      da0 -- 渐变 w.r.t. 前一个隐藏状态，形状为 numpy 数组 （n_a， m）
                      dWf -- Gradient w.r.t. 遗忘门的权重矩阵，形状的 numpy 数组 （n_a， n_a + n_x）
                      dWi -- 梯度 w.r.t. 更新门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                      dWc -- Gradient w.r.t. 内存门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                      dWo -- 梯度 w.r.t. 保存门的权重矩阵，形状为 numpy 数组 （n_a， n_a + n_x）
                      dbf -- 遗忘门的梯度偏差，形状 （n_a， 1）
                      dbi -- 更新门的梯度偏差，形状为 （n_a， 1）
                      dbc -- 内存门的梯度偏差，形状为 （n_a， 1）
                      dbo -- 保存门的梯度偏差，形状 （n_a， 1）

def lstm_backward(da, caches):
    # Retrieve values from the first cache (t=1) of caches.
    (caches, x) = caches
    (a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]
    
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = da.shape
    n_x, m = x1.shape

    # initialize the gradients with the right sizes (≈12 lines)
    dx = np.zeros((n_x, m, T_x))
    da0 = np.zeros((n_a, m))
    da_prevt = np.zeros((n_a, m))
    dc_prevt = np.zeros((n_a, m))
    dWf = np.zeros((n_a, n_a+n_x))
    dWi = np.zeros((n_a, n_a+n_x))
    dWc = np.zeros((n_a, n_a+n_x))
    dWo = np.zeros((n_a, n_a+n_x))
    dbf = np.zeros((n_a, 1))
    dbi = np.zeros((n_a, 1))
    dbc = np.zeros((n_a, 1))
    dbo = np.zeros((n_a, 1))

    # loop back over the whole sequence
    for t in reversed(range(T_x)):

        # Compute all gradients using lstm_cell_backward
        gradients = lstm_cell_backward(da[:, :, t] + da_prevt, dc_prevt, caches[t])

        # Store or add the gradient to the parameters' previous step's gradient
        dx[:,:,t] = gradients['dxt']
        dWf = dWf + gradients['dWf']
        dWi = dWi + gradients['dWi']
        dWc = dWc + gradients['dWc']
        dWo = dWo + gradients['dWo']
        dbf = dbf + gradients['dbf']
        dbi = dbi + gradients['dbi']
        dbc = dbc + gradients['dbc']
        dbo = dbo + gradients['dbo']

    # Set the first activation's gradient to the backpropagated gradient da_prev.
    da0 = gradients['da_prev']

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
    
    return gradients

np.random.seed(1)
x = np.random.randn(3,10,7)
a0 = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a, y, c, caches = lstm_forward(x, a0, parameters)

da = np.random.randn(5, 10, 4)
gradients = lstm_backward(da, caches)

print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
print("gradients[\"dx\"].shape =", gradients["dx"].shape)
print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
print("gradients[\"da0\"].shape =", gradients["da0"].shape)
print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)