

由于独特的记忆(“memory”)功能,循环神经网络(Recurrent Neural Network)在自然语言处理(NLP, Natual Language Processing)和其他处理序列的任务中有长足的应用。其可以在某个时间读取输入 x<t> x < t > (比如说是好多词),通过隐含层中从一个时间步到另一个时间步传递的激活值可以记录一些信息或者语义。比如单向的RNN(uni-directional RNN)可以获取来自过去的信息来产生后来的输出。而双向的RNN(bidirection RNN)可以从过去以及未来来获取语义信息。


  • 方括号 [l] [ l ] 表明该对象和第 l l 层相关
    • 例如:a[4]表示第 4 4 层的激活(activation)。W[5] b[5] b [ 5 ] 是第 5 5 层的参数
  • 圆括号 (i) 表明该对象和第 i i 个样本相关
    • 例如:x(i)是第 i i 个训练输入样本(example,周志华《��书》翻译成样例)
  • 尖括号 <t> 表明该对象在第 t t 个事件步(time-step)
    • 例如:x<t>是第 t t 个时间步上的输入;x(i)<t>是第 t t 个时间步上的第i个样本
    • 下标 i i 表明向量的第i个条目(entry)
      • 例如: a[l]i a i [ l ] 表明第 l l 层的激活向量的第i个条目
    • 首先来导入所需要的包

      import numpy as np
      from rnn_utils import *

      1 - 基本的RNN的前向传播(Forward propagation)

      基本的RNN结构如下,示例中, Tx=Ty T x = T y

      图1. 基本RNN结构模型


      1. 实现RNN在单个时间步上的计算
      2. 实现在 Tx T x 时间步上的循环,来一次处理所有的输入。

      1.1 - RNN 单元


      图2. 基本RNN单元 获取输入 x<t> x < t > (当前输入)和 a<t1> a < t − 1 > (从过去信息中得到的前一个隐藏状态(hidden state)),输出 a<t> a < t > ,接着传递给下一个RNN单元,并以之来预测得到 y<t> y < t >


      1. 计算出tanh函数作为激活函数的隐藏值(hidden state): a<t>=tanh(Waaa<t1>+Waxx<t>+ba) a < t > = t a n h ( W a a a < t − 1 > + W a x x < t > + b a )
      2. 使用新得到的隐藏值(hidden state) a<t> a < t > ,计算和预测 y^<t>=softmax(Wyaa<t>+by) y ^ < t > = s o f t m a x ( W y a a < t > + b y )
      3. 将计算值 (a<t>,a<t1>,xt,parameters) ( a < t > , a < t − 1 > , x t , p a r a m e t e r s ) 存储到元组cache中以便之后使用和返回
      4. 返回 a<t>,y<t>,cache a < t > , y < t > , c a c h e

      采用 m m 个样本,因此,x<t>维度大小为 (nx,m) ( n x , m ) a<t> a < t > 维度大小为 (na,m) ( n a , m )

      # 实现函数:rnn_cell_forward
      def rnn_cell_forward(xt, a_prev, parameters):
          xt -- 在 t 时间步上的输入数据,numpy数组,大小为(n_x, m)
          a_prep -- t-1 时间步的hidden state, numpy数组,大小为(n_a, m)
          parameters -- python 字典,包含:
                              Wax -- 和输入相乘的权重矩阵,numpy数组,大小为(n_a, n_x)
                              Waa -- 和隐藏状态相称的权重矩阵,numpy数组,大小为(n_a, n_a)
                              Wya -- 隐藏-输出之间相关的权重矩阵,numpy数组,大小为(n_y, n_a)
                              ba -- 偏差,numpy数组,大小是(n_a, 1)
                              by -- 偏差,隐藏-输出之间相关,numpy数组,大小为(n_y, 1)
          a_next -- 下一个隐藏状态(hidden state), 大小为(n_a, m)
          yt_pred -- 在时间步"t"上的预测值,numpy数组,大小为(n_y, m)
          cache -- 反向传播所需要的变量组成的元组,包含(a_next, a_prep, xt, parameters)
          # 从"parameters" 中回取参数
          Wax = parameters["Wax"]
          Waa = parameters["Waa"]
          Wya = parameters["Wya"]
          ba = parameters["ba"]
          by = parameters["by"]
          # 根据上方给定的公式来计算下一个激活状态
          a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
          # 根据上方给定的公职来计算当前单元的输出
          yt_pred = softmax(np.dot(Wya, a_next)+ by)
          # 存储反向传播所需要的中间变量
          cache = (a_next, a_prev, xt, parameters)
          return a_next, yt_pred, cache
      xt = np.random.randn(3,10)
      a_prev = np.random.randn(5,10)
      Waa = np.random.randn(5,5)
      Wax = np.random.randn(5,3)
      Wya = np.random.randn(2,5)
      ba = np.random.randn(5,1)
      by = np.random.randn(2,1)
      parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}
      a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)
      print("a_next[4] = ", a_next[4])
      print("a_next.shape = ", a_next.shape)
      print("yt_pred[1] =", yt_pred[1])
      print("yt_pred.shape = ", yt_pred.shape)
      a_next[4] =  [ 0.59584544  0.18141802  0.61311866  0.99808218  0.85016201  0.99980978
       -0.18887155  0.99815551  0.6531151   0.82872037]
      a_next.shape =  (5, 10)
      yt_pred[1] = [ 0.9888161   0.01682021  0.21140899  0.36817467  0.98988387  0.88945212
        0.36920224  0.9966312   0.9982559   0.17746526]
      yt_pred.shape =  (2, 10)

      1.2 - RNN 前向过程

      可以将RNN是做是刚才实现的单元的重复,加入你的输入序列包含10个时间步,那么久需要将刚才的RNN单元重复十次。每个单元接收来自前一个单元的隐藏状态( at1 a ⟨ t − 1 ⟩ )和当前时间步上的输入数据( xt x ⟨ t ⟩ ),其输出一个隐藏状态( a<t> a < t > )以及当前时间步上的预测( yt y ⟨ t ⟩ )。

      图3. 基本RNN。输入序列 x=(x1,x2,...,xTx) x = ( x ⟨ 1 ⟩ , x ⟨ 2 ⟩ , . . . , x ⟨ T x ⟩ ) 包含 Tx T x 个时间步。网络的输出是 y=(y1,y2,...,yTx) y = ( y ⟨ 1 ⟩ , y ⟨ 2 ⟩ , . . . , y ⟨ T x ⟩ )


      1. 创建一个全零向量 ( a a ) 用来存储RNN中计算过程中得到的全部隐藏状态
      2. 初始化”next”隐藏状态为a0(即初始的隐藏状态)
      3. 开始在每个时间步上循环,递增的索引值是 t t
      - 运行rnn_cell_forward函数对”下一个”隐藏状态和cache进行更行
      - 将”下一个”隐藏状态的值存入a(第 t t 的位置)
      - 将预测值存储到变量y中
      - 将cache值加入到列表caches中
      4. 返回a, y y 和caches

      # 实现函数:rnn_forward
      def rnn_forward(x, a0, parameters):
          x -- 每个时间步上的输入数据,大小为(n_x, m, T_x)
          a0 -- 初始的隐藏状态,大小为(n_a, m)
          parameters -- python 字典包含:
                              Waa -- 和隐藏状态相乘的权重矩阵,numpy数组,大小为(n_a, n_a)
                              Wax -- 和输入状态相乘的权重矩阵,numpy数组,大小为(n_a, n_x)
                              Wya -- 隐藏-输出之间相关的权重矩阵,numpy数组,大小为(n_y, n_a)
                              ba -- 偏差,numpy 数组,大小为(n_a, 1)
                              by -- 隐藏-输出相关的偏差,numpy数组,大小为(n_y, 1)
          a -- 每个时间步上的隐藏状态,numpy数组,大小为(n_a, m, T_x)
          y_pred -- 对于每个时间步骤得到的预测,numpy数组,大小为(n_y, m, T_x)
          cache -- 用于反向传播的值,包含(caches列表, x)
          # 初始化caches 列表,其中包含所有的cache
          caches = []
          # 从x的大小和parameters["Wya"]回取维数
          n_x, m, T_x = x.shape
          n_y, n_a = parameters["Wya"].shape
          # 使用全零矩阵来初始化 "a" 和 "y"
          a = np.zeros((n_a, m, T_x))
          y_pred = np.zeros((n_y, m, T_x))
          # 初始化 a_next 
          a_next = a0
          # 在所有时间步上遍历
          for t in range(T_x):
              # 更新下一个隐藏状态,计算预测值,得到cache 
              a_next, yt_pred, cache = rnn_cell_forward(x[:, :, t], a_next, parameters)
              # 将新的"下一个" 隐藏状态保存到 a 中
              a[:, :, t] = a_next
              # 将计算的预测值保存到 y 中
              y_pred[:, :, t]= yt_pred
              # 将cache增加到caches中
          caches = (caches, x)
          return a, y_pred, caches
      x = np.random.randn(3, 10, 4)
      a0 = np.random.randn(5, 10)
      Waa = np.random.randn(5, 5)
      Wax = np.random.randn(5, 3)
      Wya = np.random.randn(2, 5)
      ba = np.random.randn(5, 1)
      by = np.random.randn(2, 1)
      parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by":by}
      a, y_pred, caches = rnn_forward(x, a0, parameters)
      print("a[4][1] = ", a[4][1])
      print("a.shape = ", a.shape)
      print("y_pred[1][3] = ", y_pred[1][3])
      print("y_pred.shape = ", y_pred.shape)
      print("caches[1][1][3] = ", caches[1][1][3])
      print("len(caches) =", len(caches))
      a[4][1] =  [-0.99999375  0.77911235 -0.99861469 -0.99833267]
      a.shape =  (5, 10, 4)
      y_pred[1][3] =  [ 0.79560373  0.86224861  0.11118257  0.81515947]
      y_pred.shape =  (2, 10, 4)
      caches[1][1][3] =  [-1.1425182  -0.34934272 -0.20889423  0.58662319]
      len(caches) = 2

      目前为止,从头实现了循环神经网络的前向传播(forward propagation),这对于有的应用来说已经很足够了.但是这样还存在梯度消失的问题.所以其适用于,输出yt建立在”本地”上下文(context)(也就是说建立在输入 xt x ⟨ t ′ ⟩ 之上,其中 t t ′ t t 相隔不远)。


      2- Long Short-Term Memony(LSTM) 网络


      图4. LSTM单元:在每个时间步单元状态(“cell state”)或者叫记忆(“memory”)变量ct会进行传递和更新,和 at a ⟨ t ⟩ 进行区分。

      和上述的RNN类似,还是需要先实现在单个时间步上的LSTM单元。随后,使用一个for循环,实现整个 Tx T x 时间上的迭代调用。


      - Forget(遗忘)门


      Γtf=σ(Wf[at1,xt]+bf)(1) (1) Γ f ⟨ t ⟩ = σ ( W f [ a ⟨ t − 1 ⟩ , x ⟨ t ⟩ ] + b f )

      其中 Wf W f 用以管理遗忘门的状态。 [at1,xt] [ a ⟨ t − 1 ⟩ , x ⟨ t ⟩ ] Wf W f 相乘。得到的结果向量 Γtf Γ f ⟨ t ⟩ 元素的返回在0和1之间。遗忘门向量和之前的单元状态 ct1 c ⟨ t − 1 ⟩ 进行元素积。所以,如果 Γtf Γ f ⟨ t ⟩ 中的值是0或者是接近于零,这就意味着LSTM应该在 ct1 c ⟨ t − 1 ⟩ 中对应的地方移除该信息(比如,是单数的代词)。假如这个值是1的话,也就是会保留这个信息。

      - Update(更新)门


      Γtu=σ(Wu[at1,x{t}])(2) (2) Γ u ⟨ t ⟩ = σ ( W u [ a ⟨ t − 1 ⟩ , x { t } ] )

      和遗忘门类似,这里的 Γtu Γ u ⟨ t ⟩ 同样是值域在0到1之间的向量。将和 c~t c ~ ⟨ t ⟩ 进行元素积,从而计算出 ct c ⟨ t ⟩

      - 更新LSMT单元


      c~t=tanh(Wc[at1,xt]+bc)(3) (3) c ~ ⟨ t ⟩ = tanh ⁡ ( W c [ a ⟨ t − 1 ⟩ , x ⟨ t ⟩ ] + b c )


      ct=Γtfct1+Γtuc~t(4) (4) c ⟨ t ⟩ = Γ f ⟨ t ⟩ ∗ c ⟨ t − 1 ⟩ + Γ u ⟨ t ⟩ ∗ c ~ ⟨ t ⟩

      - Output(输出)门


      Γto=σ(Wo[at1,xt]+bo)(5) (5) Γ o ⟨ t ⟩ = σ ( W o [ a ⟨ t − 1 ⟩ , x ⟨ t ⟩ ] + b o )

      at=Γtotanh(clanglet)(6) (6) a ⟨ t ⟩ = Γ o ⟨ t ⟩ ∗ tanh ⁡ ( c l a n g l e t ⟩ )

      2.1 - LSTM单元(cell)


      1. 将 at1 a ⟨ t − 1 ⟩ xt x ∠ t ⟩ 叠加成为一个单一矩阵: concat=[at1 xt] c o n c a t = [ a ⟨ t − 1 ⟩   x ⟨ t ⟩ ]
      2. 计算公式1到公式6.(用到sigmoid()np.tanh()函数)
      3. 计算预测值 yt y ⟨ t ⟩ (使用到softmax()函数)

      # 实现函数:lstm_cell_forward
      def lstm_cell_forward(xt, a_prev, c_prev, parameters):
          xt -- 在第"t"个时间步上的输入数据,numpy数组,大小为(n_x, m)
          a_prev -- 在第"t-1"个时间步上的隐藏状态,numpy数组,大小为(n_a, m)
          c_prev -- 在第"t-1"个时间步上的记忆状态,numpy数组,大小为(n_a, m)
          parameters -- python字典,其中包含:
                              Wf -- 遗忘门的权重矩阵,numpy数组,大小为(n_a, n_a+n_x)
                              bf -- 遗忘门的偏差,numpy数组,大小为(n_a, 1)
                              Wi -- 更新门的权重矩阵, numpy数组,大小为(n_a, n_a+n_x)
                              bi -- 更新门的偏差,numpy数组,大小为(n_a, 1)
                              Wc -- 第一个"tanh"的权重矩阵,numpy数组,大小为(n_a, n_a+n_x)
                              bc -- 第一个"tanh"的偏差,numpy数组,大小为(n_a, 1)
                              Wo -- 输出门的权重矩阵,numpy数组,大小为(n_a, n_a+n_x)
                              bo -- 输出门的偏差,numpy数组,大小为(n_a, 1)
                              Wy -- 隐藏状态和输出之间的权重矩阵,numpy数组,大小为(n_y, n_a)
                              by -- 隐藏状态和输出之间的权重矩阵,numpy数组,大小为(n_y, 1)
          a_next -- 下一个隐藏状态,大小为(n_a, m)
          c_next -- 下一个记忆状态,大小为(n_a, m)
          yt_pred -- 在时间步"t"上的预测值,numpy数组,大小为(n_y, m)
          cache -- 反向过程中所需要的值组成的元组,包含(a_next, c_next, a_prev, xt, parameters)
          注意: ft/it/ot 代表 forget/update/ouput门,cct 代表中间候选(candidate )值
          # 从parameters中取回各个变量
          Wf = parameters["Wf"]
          bf = parameters["bf"]
          Wi = parameters["Wi"]
          bi = parameters["bi"]
          Wc = parameters["Wc"]
          bc = parameters["bc"]
          Wo = parameters["Wo"]
          bo = parameters["bo"]
          Wy = parameters["Wy"]
          by = parameters["by"]
          # 从xt 和 Wy 中取回各个维度数
          n_x, m = xt.shape
          n_y, n_a = Wy.shape
          # 连接 a_prep 和 xt
          concat = np.zeros((n_a+n_x,m))
          concat[: n_a, :] = a_prev
          concat[n_a :, :] = xt
          # 使用上方给定的公式来计算 ft, it, cct, ot, a_next
          ft = sigmoid(np.dot(Wf, concat)+bf)
          it = sigmoid(np.dot(Wi, concat)+bi) 
          cct = np.tanh(np.dot(Wc, concat)+bc)
          c_next = ft*c_prev+it*cct
          ot = sigmoid(np.dot(Wo, concat)+bo)
          a_next = ot*np.tanh(c_next)
          # 计算LSTM单元的预测值
          yt_pred = softmax(np.dot(Wy, a_next)+ by)
          cache = (a_next, c_next, a_prev ,c_prev, ft, it, cct, ot, xt, parameters)
          return a_next,c_next, yt_pred, cache
      xt = np.random.randn(3,10)
      a_prev = np.random.randn(5,10)
      c_prev = np.random.randn(5,10)
      Wf = np.random.randn(5, 5+3)
      bf = np.random.randn(5,1)
      Wi = np.random.randn(5, 5+3)
      bi = np.random.randn(5,1)
      Wo = np.random.randn(5, 5+3)
      bo = np.random.randn(5,1)
      Wc = np.random.randn(5, 5+3)
      bc = np.random.randn(5,1)
      Wy = np.random.randn(2,5)
      by = np.random.randn(2,1)
      parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}
      a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)
      print("a_next[4] = ", a_next[4])
      print("a_next.shape = ", c_next.shape)
      print("c_next[2] = ", c_next[2])
      print("c_next.shape = ", c_next.shape)
      print("yt[1] =", yt[1])
      print("yt.shape = ", yt.shape)
      print("cache[1][3] =", cache[1][3])
      print("len(cache) = ", len(cache))
      a_next[4] =  [-0.66408471  0.0036921   0.02088357  0.22834167 -0.85575339  0.00138482
        0.76566531  0.34631421 -0.00215674  0.43827275]
      a_next.shape =  (5, 10)
      c_next[2] =  [ 0.63267805  1.00570849  0.35504474  0.20690913 -1.64566718  0.11832942
        0.76449811 -0.0981561  -0.74348425 -0.26810932]
      c_next.shape =  (5, 10)
      yt[1] = [ 0.79913913  0.15986619  0.22412122  0.15606108  0.97057211  0.31146381
        0.00943007  0.12666353  0.39380172  0.07828381]
      yt.shape =  (2, 10)
      cache[1][3] = [-0.16263996  1.03729328  0.72938082 -0.54101719  0.02752074 -0.30821874
        0.07651101 -1.03752894  1.41219977 -0.37647422]
      len(cache) =  10

      2.2- LSTM的前向过程


      图5. 多时间步上的LSTM

      实战:实现lstm_forward()来在 Tx T x 上运行LSTM

      注意 c0 c ⟨ 0 ⟩ 初 始 化 为 全 零 矩 阵

      # 实现函数:lstm_forward 
      def lstm_forward(x, a0, parameters):
          x -- 每个时间步上的输入输入数据,大小为(n_x, m, T_x)
          a0 -- 初始化的隐藏状态,大小为(n_a, m)
          parameters -- python字典,包含:
                              Wf -- 遗忘门的权重矩阵,nmypy数组,大小是(n_a, n_a+n_x)
                              bf -- 遗忘门的偏差,numpy数组,大小是(n_a, 1)
                              Wi -- 更新门的权重矩阵, numpy数组,大小为(n_a, n_a+n_x)
                              bi -- 更新门的偏差,numpy数组,大小为(n_a, 1)
                              Wc -- 第一个"tanh"的权重矩阵,numpy数组,大小为(n_a, n_a+n_x)
                              bc -- 第一个"tanh"的偏差,numpy数组,大小为(n_a, 1)
                              Wo -- 输出门的权重矩阵,numpy数组,大小为(n_a, n_a+n_x)
                              bo -- 输出门的偏差,numpy数组,大小为(n_a, 1)
                              Wy -- 隐藏状态和输出之间的权重矩阵,numpy数组,大小为(n_y, n_a)
                              by -- 隐藏状态和输出之间的权重矩阵,numpy数组,大小为(n_y, 1)
          a -- 在所有时间步上的隐藏状态,numpy数组,大小为(n_a, m, T_x)
          y -- 对于每个时间步上的预测,numpy数组,大小为(n_y, m, T_x)
          c -- 在所有时间步上的记忆状态,numpy数组,大小为(n_a, m, T_x) # 注释,作业没有这句,自行补上的
          caches -- 所有用于方向过程的值组成的元组,包含(cache列表, x)
          # 初始化"caches"
          caches = []
          # 从 x 和 parameters['Wy']中取回维度数
          n_x, m, T_x = x.shape
          n_y, n_a = parameters["Wy"].shape
          # 初始化"a", "x", "y"为全零
          a = np.zeros((n_a, m, T_x))
          c = np.zeros((n_a, m, T_x))
          y = np.zeros((n_y, m, T_x))
          # 初始化 a_next 和 c_next
          a_next = a0
          c_next = np.zeros((n_a, m))
          # 所有时间步上的循环
          for t in range(T_x):
              # 更新隐藏状态,下一个记忆状态,计算预测,保存cache
              a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
              # 将新的"next"隐藏状态保存到a中
              a[:, :, t] = a_next
              # 将预测值保存到y中
              y[:,:,t] = yt
              # 存储下一个单元状态
              c[:,:,t] = c_next
              # 存储cache到caches
          # 将反向传播的值存储
          caches = (caches, x)
          return a, y, c, caches
      x = np.random.randn(3,10,7)
      a0 = np.random.randn(5,10)
      Wf = np.random.randn(5, 5+3)
      bf = np.random.randn(5,1)
      Wi = np.random.randn(5, 5+3)
      bi = np.random.randn(5,1)
      Wo = np.random.randn(5, 5+3)
      bo = np.random.randn(5,1)
      Wc = np.random.randn(5, 5+3)
      bc = np.random.randn(5,1)
      Wy = np.random.randn(2,5)
      by = np.random.randn(2,1)
      parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}
      a, y, c, caches = lstm_forward(x, a0, parameters)
      print("a[4][3][6] = ", a[4][3][6])
      print("a.shape = ", a.shape)
      print("y[1][4][3] =", y[1][4][3])
      print("y.shape = ", y.shape)
      print("caches[1][1[1]] =", caches[1][1][1])
      print("c[1][2][1]", c[1][2][1])
      print("len(caches) = ", len(caches))
      a[4][3][6] =  0.172117767533
      a.shape =  (5, 10, 7)
      y[1][4][3] = 0.95087346185
      y.shape =  (2, 10, 7)
      caches[1][1[1]] = [ 0.82797464  0.23009474  0.76201118 -0.22232814 -0.20075807  0.18656139
      c[1][2][1] -0.855544916718
      len(caches) =  2


      3 - 循环神经网络中的反向传播过程



      3.1 - 基本RNN的反向过程


      图6. RNN单元的反向过程: 和全连接神经网络类似,在整个RNN上,成本(cost)函数 J J 导数的计算遵循链式法则,链式法则同样用于计算(JWax,JWaa,Jb),从而对 (Wax,Waa,ba) ( W a x , W a a , b a ) 进行更新。



      函数 tanh tanh 的倒数是 1tanh(x)2 1 − tanh ⁡ ( x ) 2 ,你可以在这里找到公式的完整证明。注意: sech(x)2=1tanh(x)2 sech ( x ) 2 = 1 − tanh ⁡ ( x ) 2

      所以同理, atWax,atWaa,atb ∂ a ⟨ t ⟩ ∂ W a x , ∂ a ⟨ t ⟩ ∂ W a a , ∂ a ⟨ t ⟩ ∂ b , tanh(u) tanh ⁡ ( u ) 的导数是 (1tanh(u)2)du ( 1 − tanh ⁡ ( u ) 2 ) d u

      def rnn_cell_backward(da_next, cache):
          Implements the backward pass for the RNN-cell (single time-step).
          da_next -- Gradient of loss with respect to next hidden state
          cache -- python dictionary containing useful values (output of rnn_cell_forward())
          gradients -- python dictionary containing:
                              dx -- Gradients of input data, of shape (n_x, m)
                              da_prev -- Gradients of previous hidden state, of shape (n_a, m)
                              dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                              dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                              dba -- Gradients of bias vector, of shape (n_a, 1)
          # Retrieve values from cache
          (a_next, a_prev, xt, parameters) = cache
          # Retrieve values from parameters
          Wax = parameters["Wax"]
          Waa = parameters["Waa"]
          Wya = parameters["Wya"]
          ba = parameters["ba"]
          by = parameters["by"]
          ### START CODE HERE ###
          # compute the gradient of tanh with respect to a_next (≈1 line)
          dtanh = None
          # compute the gradient of the loss with respect to Wax (≈2 lines)
          dxt = None
          dWax = None
          # compute the gradient with respect to Waa (≈2 lines)
          da_prev = None
          dWaa = None
          # compute the gradient with respect to b (≈1 line)
          dba = None
          ### END CODE HERE ###
          # Store the gradients in a python dictionary
          gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
          return gradients
      xt = np.random.randn(3,10)
      a_prev = np.random.randn(5,10)
      Wax = np.random.randn(5,3)
      Waa = np.random.randn(5,5)
      Wya = np.random.randn(2,5)
      b = np.random.randn(5,1)
      by = np.random.randn(2,1)
      parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
      a_next, yt, cache = rnn_cell_forward(xt, a_prev, parameters)
      da_next = np.random.randn(5,10)
      gradients = rnn_cell_backward(da_next, cache)
      print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
      print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
      print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
      print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
      print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
      print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
      print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
      print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
      print("gradients[\"dba\"][4] =", gradients["dba"][4])
      print("gradients[\"dba\"].shape =", gradients["dba"].shape)
      Backward pass through the RNN

      Computing the gradients of the cost with respect to at a ⟨ t ⟩ at every time-step t t is useful because it is what helps the gradient backpropagate to the previous RNN-cell. To do so, you need to iterate through all the time steps starting at the end, and at each step, you increment the overall dba, dWaa d W a a , dWax d W a x and you store dx d x .


      Implement the rnn_backward function. Initialize the return variables with zeros first and then loop through all the time steps while calling the rnn_cell_backward at each time timestep, update the other variables accordingly.

      def rnn_backward(da, caches):
          Implement the backward pass for a RNN over an entire sequence of input data.
          da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
          caches -- tuple containing information from the forward pass (rnn_forward)
          gradients -- python dictionary containing:
                              dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                              da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                              dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                              dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                              dba -- Gradient w.r.t the bias, of shape (n_a, 1)
          ### START CODE HERE ###
          # Retrieve values from the first cache (t=1) of caches (≈2 lines)
          (caches, x) = None
          (a1, a0, x1, parameters) = None
          # Retrieve dimensions from da's and x1's shapes (≈2 lines)
          n_a, m, T_x = None
          n_x, m = None
          # initialize the gradients with the right sizes (≈6 lines)
          dx = None
          dWax = None
          dWaa = None
          dba = None
          da0 = None
          da_prevt = None
          # Loop through all the time steps
          for t in reversed(range(None)):
              # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
              gradients = None
              # Retrieve derivatives from gradients (≈ 1 line)
              dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
              # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
              dx[:, :, t] = None
              dWax += None
              dWaa += None
              dba += None
          # Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
          da0 = None
          ### END CODE HERE ###
          # Store the gradients in a python dictionary
          gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
          return gradients
      x = np.random.randn(3,10,4)
      a0 = np.random.randn(5,10)
      Wax = np.random.randn(5,3)
      Waa = np.random.randn(5,5)
      Wya = np.random.randn(2,5)
      ba = np.random.randn(5,1)
      by = np.random.randn(2,1)
      parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
      a, y, caches = rnn_forward(x, a0, parameters)
      da = np.random.randn(5, 10, 4)
      gradients = rnn_backward(da, caches)
      print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
      print("gradients[\"dx\"].shape =", gradients["dx"].shape)
      print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
      print("gradients[\"da0\"].shape =", gradients["da0"].shape)
      print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
      print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
      print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
      print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
      print("gradients[\"dba\"][4] =", gradients["dba"][4])
      print("gradients[\"dba\"].shape =", gradients["dba"].shape)

      3.2 - LSTM backward pass

      3.2.1 One Step backward

      The LSTM backward pass is slighltly more complicated than the forward one. We have provided you with all the equations for the LSTM backward pass below. (If you enjoy calculus exercises feel free to try deriving these from scratch yourself.)

      3.2.2 gate derivatives

      dΓto=danexttanh(cnext)Γto(1Γto)(7) (7) d Γ o ⟨ t ⟩ = d a n e x t ∗ tanh ⁡ ( c n e x t ) ∗ Γ o ⟨ t ⟩ ∗ ( 1 − Γ o ⟨ t ⟩ )

      dc~t=dcnextΓtu+Γto(1tanh(cnext)2)itdanextc~t(1tanh(c~)2)(8) (8) d c ~ ⟨ t ⟩ = d c n e x t ∗ Γ u ⟨ t ⟩ + Γ o ⟨ t ⟩ ( 1 − tanh ⁡ ( c n e x t ) 2 ) ∗ i t ∗ d a n e x t ∗ c ~ ⟨ t ⟩ ∗ ( 1 − tanh ⁡ ( c ~ ) 2 )

      dΓtu=dcnextc~t+Γto(1tanh(cnext)2)c~tdanextΓtu(1Γtu)(9) (9) d Γ u ⟨ t ⟩ = d c n e x t ∗ c ~ ⟨ t ⟩ + Γ o ⟨ t ⟩ ( 1 − tanh ⁡ ( c n e x t ) 2 ) ∗ c ~ ⟨ t ⟩ ∗ d a n e x t ∗ Γ u ⟨ t ⟩ ∗ ( 1 − Γ u ⟨ t ⟩ )

      dΓtf=dcnextc~prev+Γto(1tanh(cnext)2)cprevdanextΓtf(1Γtf)(10) (10) d Γ f ⟨ t ⟩ = d c n e x t ∗ c ~ p r e v + Γ o ⟨ t ⟩ ( 1 − tanh ⁡ ( c n e x t ) 2 ) ∗ c p r e v ∗ d a n e x t ∗ Γ f ⟨ t ⟩ ∗ ( 1 − Γ f ⟨ t ⟩ )

      3.2.3 parameter derivatives

      dWf=dΓtf(aprevxt)T(11) (11) d W f = d Γ f ⟨ t ⟩ ∗ ( a p r e v x t ) T

      dWu=dΓtu(aprevxt)T(12) (12) d W u = d Γ u ⟨ t ⟩ ∗ ( a p r e v x t ) T

      dWc=dc~t(aprevxt)T(13) (13) d W c = d c ~ ⟨ t ⟩ ∗ ( a p r e v x t ) T

      dWo=dΓto(aprevxt)T(14) (14) d W o = d Γ o ⟨ t ⟩ ∗ ( a p r e v x t ) T

      To calculate dbf,dbu,dbc,dbo d b f , d b u , d b c , d b o you just need to sum across the horizontal (axis= 1) axis on dΓtf,dΓtu,dc~t,dΓto d Γ f ⟨ t ⟩ , d Γ u ⟨ t ⟩ , d c ~ ⟨ t ⟩ , d Γ o ⟨ t ⟩ respectively. Note that you should have the keep_dims = True option.

      Finally, you will compute the derivative with respect to the previous hidden state, previous memory state, and input.

      daprev=WTfdΓtf+WTudΓtu+WTcdc~t+WTodΓto(15) (15) d a p r e v = W f T ∗ d Γ f ⟨ t ⟩ + W u T ∗ d Γ u ⟨ t ⟩ + W c T ∗ d c ~ ⟨ t ⟩ + W o T ∗ d Γ o ⟨ t ⟩

      Here, the weights for equations 13 are the first n_a, (i.e. Wf=Wf[:na,:] W f = W f [ : n a , : ] etc…)

      dcprev=dcnextΓtf+Γto(1tanh(cnext)2)Γtfdanext(16) (16) d c p r e v = d c n e x t Γ f ⟨ t ⟩ + Γ o ⟨ t ⟩ ∗ ( 1 − tanh ⁡ ( c n e x t ) 2 ) ∗ Γ f ⟨ t ⟩ ∗ d a n e x t

      dxt=WTfdΓtf+WTudΓtu+WTcdc~t+WTodΓto(17) (17) d x ⟨ t ⟩ = W f T ∗ d Γ f ⟨ t ⟩ + W u T ∗ d Γ u ⟨ t ⟩ + W c T ∗ d c ~ t + W o T ∗ d Γ o ⟨ t ⟩

      where the weights for equation 15 are from n_a to the end, (i.e. Wf=Wf[na:,:] W f = W f [ n a : , : ] etc…)

      Exercise: Implement lstm_cell_backward by implementing equations 717 7 − 17 below. Good luck! :)

      def lstm_cell_backward(da_next, dc_next, cache):
          Implement the backward pass for the LSTM-cell (single time-step).
          da_next -- Gradients of next hidden state, of shape (n_a, m)
          dc_next -- Gradients of next cell state, of shape (n_a, m)
          cache -- cache storing information from the forward pass
          gradients -- python dictionary containing:
                              dxt -- Gradient of input data at time-step t, of shape (n_x, m)
                              da_prev -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                              dc_prev -- Gradient w.r.t. the previous memory state, of shape (n_a, m, T_x)
                              dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                              dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                              dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                              dWo -- Gradient w.r.t. the weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                              dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                              dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                              dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                              dbo -- Gradient w.r.t. biases of the output gate, of shape (n_a, 1)
          # Retrieve information from "cache"
          (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache
          ### START CODE HERE ###
          # Retrieve dimensions from xt's and a_next's shape (≈2 lines)
          n_x, m = None
          n_a, m = None
          # Compute gates related derivatives, you can find their values can be found by looking carefully at equations (7) to (10) (≈4 lines)
          dot = None
          dcct = None
          dit = None
          dft = None
          # Code equations (7) to (10) (≈4 lines)
          dit = None
          dft = None
          dot = None
          dcct = None
          # Compute parameters related derivatives. Use equations (11)-(14) (≈8 lines)
          dWf = None
          dWi = None
          dWc = None
          dWo = None
          dbf = None
          dbi = None
          dbc = None
          dbo = None
          # Compute derivatives w.r.t previous hidden state, previous memory state and input. Use equations (15)-(17). (≈3 lines)
          da_prev = None
          dc_prev = None
          dxt = None
          ### END CODE HERE ###
          # Save gradients in dictionary
          gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                      "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
          return gradients
      xt = np.random.randn(3,10)
      a_prev = np.random.randn(5,10)
      c_prev = np.random.randn(5,10)
      Wf = np.random.randn(5, 5+3)
      bf = np.random.randn(5,1)
      Wi = np.random.randn(5, 5+3)
      bi = np.random.randn(5,1)
      Wo = np.random.randn(5, 5+3)
      bo = np.random.randn(5,1)
      Wc = np.random.randn(5, 5+3)
      bc = np.random.randn(5,1)
      Wy = np.random.randn(2,5)
      by = np.random.randn(2,1)
      parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}
      a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)
      da_next = np.random.randn(5,10)
      dc_next = np.random.randn(5,10)
      gradients = lstm_cell_backward(da_next, dc_next, cache)
      print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
      print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
      print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
      print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
      print("gradients[\"dc_prev\"][2][3] =", gradients["dc_prev"][2][3])
      print("gradients[\"dc_prev\"].shape =", gradients["dc_prev"].shape)
      print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
      print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
      print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
      print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
      print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
      print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
      print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
      print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
      print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
      print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
      print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
      print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
      print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
      print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
      print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
      print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)

      3.3 Backward pass through the LSTM RNN

      This part is very similar to the rnn_backward function you implemented above. You will first create variables of the same dimension as your return variables. You will then iterate over all the time steps starting from the end and call the one step function you implemented for LSTM at each iteration. You will then update the parameters by summing them individually. Finally return a dictionary with the new gradients.

      Instructions: Implement the lstm_backward function. Create a for loop starting from Tx T x and going backward. For each step call lstm_cell_backward and update the your old gradients by adding the new gradients to them. Note that dxt is not updated but is stored.

      def lstm_backward(da, caches):
          Implement the backward pass for the RNN with LSTM-cell (over a whole sequence).
          da -- Gradients w.r.t the hidden states, numpy-array of shape (n_a, m, T_x)
          dc -- Gradients w.r.t the memory states, numpy-array of shape (n_a, m, T_x)
          caches -- cache storing information from the forward pass (lstm_forward)
          gradients -- python dictionary containing:
                              dx -- Gradient of inputs, of shape (n_x, m, T_x)
                              da0 -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                              dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                              dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                              dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                              dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                              dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                              dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                              dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                              dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
          # Retrieve values from the first cache (t=1) of caches.
          (caches, x) = caches
          (a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]
          ### START CODE HERE ###
          # Retrieve dimensions from da's and x1's shapes (≈2 lines)
          n_a, m, T_x = None
          n_x, m = None
          # initialize the gradients with the right sizes (≈12 lines)
          dx = None
          da0 = None
          da_prevt = None
          dc_prevt = None
          dWf = None
          dWi = None
          dWc = None
          dWo = None
          dbf = None
          dbi = None
          dbc = None
          dbo = None
          # loop back over the whole sequence
          for t in reversed(range(None)):
              # Compute all gradients using lstm_cell_backward
              gradients = None
              # Store or add the gradient to the parameters' previous step's gradient
              dx[:,:,t] = None
              dWf = None
              dWi = None
              dWc = None
              dWo = None
              dbf = None
              dbi = None
              dbc = None
              dbo = None
          # Set the first activation's gradient to the backpropagated gradient da_prev.
          da0 = None
          ### END CODE HERE ###
          # Store the gradients in a python dictionary
          gradients = {"dx": dx, "da0": da0, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                      "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
          return gradients
      x = np.random.randn(3,10,7)
      a0 = np.random.randn(5,10)
      Wf = np.random.randn(5, 5+3)
      bf = np.random.randn(5,1)
      Wi = np.random.randn(5, 5+3)
      bi = np.random.randn(5,1)
      Wo = np.random.randn(5, 5+3)
      bo = np.random.randn(5,1)
      Wc = np.random.randn(5, 5+3)
      bc = np.random.randn(5,1)
      parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}
      a, y, c, caches = lstm_forward(x, a0, parameters)
      da = np.random.randn(5, 10, 4)
      gradients = lstm_backward(da, caches)
      print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
      print("gradients[\"dx\"].shape =", gradients["dx"].shape)
      print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
      print("gradients[\"da0\"].shape =", gradients["da0"].shape)
      print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
      print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
      print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
      print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
      print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
      print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
      print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
      print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
      print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
      print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
      print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
      print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
      print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
      print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
      print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
      print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)

      Congratulations !

      Congratulations on completing this assignment. You now understand how recurrent neural networks work!

      Lets go on to the next exercise, where you’ll use an RNN to build a character-level language model.





