1.RNN-- 使用numpy构建RNN单元

0. 使用numpy构建一个简单的RNN

import numpy as np
timesteps = 10
input_features = 4
output_features = 8

inputs = np.random.random((timesteps, input_features))
print('inputs shape is ', inputs.shape)
# print(inputs)

state_t = np.zeros((output_features,))
print('state_t shape is ', state_t.shape)
# print(state_t)
W = np.random.random((output_features, input_features))
print('W shape is ', W.shape)
U = np.random.random((output_features, output_features))
print('U shape is ', U.shape)
b = np.random.random((output_features, ))
print('b shape is ', b.shape)
successive_outputs = []
for input_t in inputs:
    output_t = np.tanh(np.dot(W, input_t) +  np.dot(U, state_t) +b)
    state_t = output_t

inal_output_sequence = np.stack(successive_outputs, axis=0)



每个时间步的输入对应一个时间步的输出,其中一个方框就是一个RNN单元,前一个时间步(t-1)的隐形状态会对下一个时间步(t)造成影响,我们把隐藏状态计做 a &lt; t &gt; a^{&lt;t&gt;} a<t>,每个RNN单元都一个输入 x &lt; t &gt; x^{&lt;t&gt;} x<t>,输出 y &lt; t &gt; y^{&lt;t&gt;} y<t>,输出状态 a &lt; t &gt; a^{&lt;t&gt;} a<t>,上个时间步的输出状态作为下个时间步的状态输入



2.1 RNN call


  1. x &lt; t &gt; x^{&lt;t&gt;} x<t>:当前输入
  2. a &lt; t − 1 &gt; a^{&lt;t-1&gt;} a<t1>:包含过去信息的上一个单元的隐藏状态
  3. a &lt; t &gt; a^{&lt;t&gt;} a<t> :输出状态,下一个RNN单元的输入状态
  4. y &lt; t &gt; y^{&lt;t&gt;} y<t>:预测结果



  1. 使用双曲线激活函数计算隐藏状态 a t = t a n h ( W a a a t − 1 + W a x x t + b a ) a^{t} = tanh(W_{aa} a^{t-1} + W_{ax}x^{t} +b_a) at=tanh(Waaat1+Waxxt+ba)
  2. 使用上一步获得的隐藏状态 a &lt; t &gt; a^{&lt;t&gt;} a<t>, 计算预测 y ^ &lt; t &gt; = s o f t m a x ( W y a a &lt; t &gt; + b y ) \hat{y}^{&lt;t&gt;} =softmax(W_{ya} a^{&lt;t&gt;} + b_y) y^<t>=softmax(Wyaa<t>+by)。这里使用 softmax激活函数
  3. Store( a &lt; t &gt; , a &lt; t − 1 &gt; , x &lt; t &gt; , p a r a m e t e r s a^{&lt;t&gt;} , a^{&lt;t-1&gt;}, x^{&lt;t&gt;}, parameters a<t>,a<t1>,x<t>,parameters) in cache
  4. 返回 a &lt; t &gt; , y &lt; t &gt; a^{&lt;t&gt;}, y^{&lt;t&gt;} a<t>,y<t> 并保存

我们要向量化 m 例子,例如, x &lt; t &gt; x^{&lt;t&gt;} x<t>的形状为( n x n_x nx, m), and a &lt; t &gt; a^{&lt;t&gt;} a<t> 的形状为( n a n_a na, m)的矩阵

def rnn_cell_forward(xt, a_prev, parameters):
    Implements a single forward step of the RNN-cell as described in Figure (2)
    xt -- your input data at timestep 't' , numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep 't', numpy array of shape (n_a, m).
    parameters -- python dictionary containing:
                            Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                            Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                            Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                            ba --  Bias, numpy array of shape (n_a, 1)
                            by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_next) + by)
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    return a_next, yt_pred, cache
xt = np.random.randn(3,10)
print('xt shape ', xt.shape)
a_prev = np.random.randn(5,10)
print('a_prev shape ', a_prev.shape)
Waa = np.random.randn(5,5)
print('Waa shape ', Waa.shape)
Wax = np.random.randn(5,3)
print('Wax shape ', Wax.shape)
Wya = np.random.randn(2,5)
print('Wya shape ', Wya.shape)
ba = np.random.randn(5,1)
print('ba shape ', ba.shape)
by = np.random.randn(2,1)
print('by shape ', by.shape)

parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}
# print(parameters)

a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)

print('a_next shape = ', a_next.shape)
print('yt_pred shape = ', yt_pred.shape)

print("a_next[4] = ", a_next[4])
print("a_next.shape = ", a_next.shape)
print("yt_pred[1] =", yt_pred[1])
print("yt_pred.shape = ", yt_pred.shape)

2.2 RNN 前向传播


可以将三维数组理解为一个立方体,我们将一个3x3的三维数组放到一个立方体中,x[0]表示为立方体的的所有第0行,理解为立方体的水平的最上面的一层面,x[0][0] 表示第0层中的第0列

import numpy as np
x = np.arange(27)
x = np.reshape(x, (3,3,3))
print('(行,列, 通道)', x.shape)
print('第0行', x[0])

print(x[:, :, 0]) # 表示第0纵深切面

如下图,我们可以通过循环使用单个RNN单元构成一个RNN。如果你要输入一个包含10个时间序列的数据,你需要复制RNN单元10次。每一个单元的隐藏状态输入( a &lt; t − 1 &gt; a^{&lt;t-1&gt;} a<t1>)都是上一个单元的隐藏状态输出,并且当前输入为 x &lt; y &gt; x^{&lt;y&gt;} x<y>。他的输出隐藏状态是 a &lt; t &gt; a^{&lt;t&gt;} a<t>,预测是 y &lt; t &gt; y^{&lt;t&gt;} y<t>

  1. 输入序列: x = ( x ( 1 ) , x ( 2 ) , . . . . . . . . , x ( T x ) ) x = (x^{(1)}, x^{(2)} , ........ , x^{(T_x)}) x=(x(1),x(2),........,x(Tx))
  2. 输出: y = ( y ( 1 ) , y ( 2 ) , . . . . . . . . , y ( T x ) ) y = (y^{(1)}, y^{(2)} , ........ , y^{(T_x)}) y=(y(1),y(2),........,y(Tx))

练习 代码实现的前向传播,就如上图描述的RNN网络


  1. 定义一个全零向量 a ,用来保存RNN计算的隐藏状态
  2. 初始化 下一个隐藏状态 为 a 0 a_0 a0
  3. 开始按照时间步循环,步进索引为 时间 t
    • 通过函数 rnn_cell_forward更新 next 隐藏状态 和 记忆缓存(cache)
    • 保存 next 隐藏状态 到 a ( t &lt; t h &gt; t^{&lt;th&gt;} t<th> position)
    • 保存预测 到 y
    • 添加 缓存记忆到 caches
  4. 返回 a,y 和 caches
def rnn_forward(x, a0, parameters):
    Implement the forward propagation of the recurrent neural network described in Figure (3).
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    # Initialize "caches" which will contain the list of all caches
    caches = []
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    # initialize "a" and "y" with zeros (2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    # Initialize a_next (1 line)
    a_next = a0
    # loop over all time-step:
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache (1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:, :, t], a_next, parameters)
        # save the value of the new "next " hidden state in a 
        a[:, :, t] = a_next
        # Save the value of the prediction in y (1 line)
        y_pred[:, :, t] = yt_pred
        # append "cache" to "caches"
    # store values needed for backward propagation in cache
    caches = (caches, x)
    return a, y_pred, caches
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a, y_pred, caches = rnn_forward(x, a0, parameters)

print("a[4][1] = ", a[4][1])
print("a.shape = ", a.shape)
print("y_pred[1][3] =", y_pred[1][3])
print("y_pred.shape = ", y_pred.shape)
print("caches[1][1][3] =", caches[1][1])
print("len(caches) = ", len(caches))


3.1 什么是LSTM


  • a &lt; t − 1 &gt; a^{&lt;t-1&gt;} a<t1> 短记忆
  • c &lt; t − 1 &gt; c^{&lt;t-1&gt;} c<t1> 长记忆

3.2 遗忘门

为了说明这个例子,假设我们正在从一段文字中读取一写单词,并且我们想使用 LSTM 来解析并存储语法结构,比如主语是单数还是复数。如果主语从单数变成了复数,我们就得找到一个方法来忘记存储的主语的单复数性质的记忆值。在一个LSTM中遗忘门如下

遗忘权重: Γ f &lt; t &gt; = σ ( W f [ a &lt; t − 1 &gt; , x &lt; t &gt; ] + b f ) \Gamma ^{&lt;t&gt;}_f = \sigma(W_f[a^{&lt;t-1&gt;} , x^{&lt;t&gt;} ] + b_f) Γf<t>=σ(Wf[a<t1>,x<t>]+bf)

遗忘门: Γ f &lt; t &gt; ∗ c &lt; t − 1 &gt; \Gamma ^{&lt;t&gt;}_f * c^{&lt;t -1 &gt;} Γf<t>c<t1>

这里 W f W_f Wf是权重,决定遗忘门能忘记多少。我们结合矩阵 [ a &lt; t − 1 &gt; , x &lt; t &gt; ] [a^{&lt;t-1&gt;} , x^{&lt;t&gt;}] [a<t1>,x<t>],并将结果和 W f W_f Wf进行矩阵相乘,等式结果$\Gamma ^{}_f 是 一 个 向 量 , 他 们 的 值 在 0 到 1 之 间 。 这 个 遗 忘 门 结 果 , 就 是 前 面 经 过 得 到 的 向 量 , 将 之 与 上 一 个 L S T M 的 状 态 是一个向量,他们的值在0到1之间。这个遗忘门结果,就是前面经过得到的向量,将之与上一个LSTM 的状态 01LSTMc^{} ( 就 是 长 记 忆 ) 进 行 矩 阵 的 元 素 相 乘 。 如 果 (就是长记忆)进行矩阵的元素相乘。如果 \Gamma ^{}_f 向 量 中 的 一 个 元 素 为 0 , 那 么 就 说 明 要 遗 忘 向量中的一个元素为0,那么就说明要遗忘 0c^{}$相对应的记忆。如果是1,就保存记忆

3.3 更新门


Γ u &lt; t &gt; = σ ( W u [ a &lt; t − 1 &gt; , x &lt; t &gt; ] + b u ) \Gamma ^{&lt;t&gt;}_u = \sigma(W_u[a^{&lt;t-1&gt;} , x^{&lt;t&gt;} ] + b_u) Γu<t>=σ(Wu[a<t1>,x<t>]+bu)

和遗忘门类似,$\Gamma ^{}_u $也是一个0-1的向量, 为了计算 c &lt; t &gt; c^{&lt;t&gt;} c<t>,他将和 c ~ &lt; t &gt; \tilde{c}^{&lt;t&gt;} c~<t>元素级别矩阵乘积,


为了更新新的主语属性,我们需要将上个LSTM的短期记忆( a &lt; t − 1 &gt; a^{&lt;t-1&gt;} a<t1>)和当前输入结合,并计算出当前的学习到的内容,公式如下

c ~ &lt; t &gt; = t a n h ( W c [ a &lt; t − 1 &gt; , x &lt; t &gt; ] + b c ) \tilde{c}^{&lt;t&gt;} = tanh(W_c[a^{&lt;t-1&gt;} , x^{&lt;t&gt;} ] + b_c) c~<t>=tanh(Wc[a<t1>,x<t>]+bc)



c &lt; t &gt; = Γ f &lt; t &gt; ∗ c &lt; t − 1 &gt; + Γ u &lt; t &gt; ∗ c ~ &lt; t &gt; c^{&lt;t&gt;} = \Gamma ^{&lt;t&gt;}_f *c^{&lt;t-1&gt;} + \Gamma ^{&lt;t&gt;}_u * \tilde{c}^{&lt;t&gt;} c<t>=Γf<t>c<t1>+Γu<t>c~<t>

3.4 输出门

为了得到下一个LSTM的短记忆( a &lt; t − 1 &gt; a^{&lt;t-1&gt;} a<t1>),我们需要需要使用以下公式来得到


Γ o &lt; t &gt; = σ ( W o [ a &lt; t − 1 &gt; , x &lt; t &gt; ] + b o ) \Gamma ^{&lt;t&gt;}_o = \sigma(W_o[a^{&lt;t-1&gt;} , x^{&lt;t&gt;} ] + b_o) Γo<t>=σ(Wo[a<t1>,x<t>]+bo)


a &lt; t &gt; = Γ o &lt; t &gt; ∗ t a n h ( c &lt; t &gt; ) a^{&lt;t&gt;} = \Gamma ^{&lt;t&gt;}_o * tanh(c^{&lt;t&gt;}) a<t>=Γo<t>tanh(c<t>)

3.5 LSTM cell


  1. 结合 a &lt; t − 1 &gt; 和 x &lt; t &gt; a^{&lt;t-1&gt;} 和 x^{&lt;t&gt;} a<t1>x<t>称为一个单独的矩阵:concat =
    [ a &lt; t − 1 &gt; x &lt; t &gt; ] \left[ \begin{array}{ccc} a^{&lt;t-1&gt;} \\ x^{&lt;t&gt;} \\ \end{array}\right] [a<t1>x<t>]
def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)

    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
          c stands for the memory value
    # 提取权重参数
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    # 获得输入xt 和Wy的形状
    n_x, m = xt.shape
    n_y, n_a = Wy.shape
    #  链接a_prev and xt
    concat = np.zeros((n_x + n_a, m))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt
    # 计算 ft,it, cct, c_next, ot, a_next
    ft = sigmoid(np.dot(Wf, concat) + bf)
    # print("ft shape = ", ft.shape)
    it = sigmoid(np.dot(Wi, concat) + bi)
    # print("it shape = ", it.shape)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    # print("cct shape = ", cct.shape)
    c_next = ft * c_prev + it * cct
    # print("c_next shape = ", c_next.shape)
    ot = sigmoid(np.dot(Wo, concat) + bo)
    # print("ot shape = ", ot.shape)
    a_next = ot * np.tanh(c_next)
    # print("a_next shape = ", a_next.shape)
    # 计算LSTM的预测
    yt_pred = softmax(np.dot(Wy, a_next) + by)
    # print("yt_pred shape = ", yt_pred.shape)
    # 存储反向传播的信息到 cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)
    return a_next, c_next, yt_pred, cache

3.6 LSTM 的前向传播

c &lt; 0 &gt; c^{&lt;0&gt;} c<0> 初始化为全0

def lstm_forward(x, a0, parameters):
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo -- Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
    # 初始化存储
    caches = []
    # 获得x 和 parameters['Wy'] 的形状
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape
    # 初始化 a, c, and y 为全0
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    # 初始化 a_next, c_next
    a_next = a0
    c_next = np.zeros((n_a, m))
    # 按时间步循环
    for t in range(T_x):
        # Update next hidden state, next memory state, compute the prediction, get the cache
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
        # Save the value of the new "next" hidden state in a
        a[:, :, t] = a_next
        # Save the value of the prediction in y
        y[:, :, t] = yt
        # Save the value of the next cell state 
        c[:, :, t] = c_next
        # Append the cache into caches
    caches = (caches, x)
    return a, y, c, caches


