




我们来考虑多对多的RNN结构,输入是 x ⟨ 1 ⟩ , x ⟨ 2 ⟩ , ⋯   , x ⟨ T x ⟩ x^{\langle 1 \rangle} ,x^{\langle 2 \rangle},\cdots,x^{\langle T_x \rangle} x1,x2,,xTx,想要产生的输出为 y ⟨ 1 ⟩ , y ⟨ 2 ⟩ , ⋯   , y ⟨ T x ⟩ y^{\langle 1 \rangle} ,y^{\langle 2 \rangle},\cdots,y^{\langle T_x \rangle} y1,y2,,yTx。其中 x ⟨ i ⟩ x^{\langle i \rangle} xi y ⟨ i ⟩ y^{\langle i \rangle} yi是任意维度的向量。
RNN原理是循环地更新隐藏状态(激活值 a ⟨ i ⟩ a^{\langle i \rangle} ai),激活值也可以是任意维度,在任意时间步 t t t:

  1. 下个隐藏状态 a ⟨ t ⟩ a^{\langle t \rangle} at是通过前一个隐藏状态 a ⟨ t − 1 ⟩ a^{\langle t-1 \rangle} at1和当前输入 x ⟨ t ⟩ x^{\langle t \rangle} xt来计算的。
  2. 而输出值 y ⟨ t ⟩ y^{\langle t \rangle} yt(预测值)是基于 a ⟨ t ⟩ a^{\langle t \rangle} at来计算的。



总体来说是这样的: a 5 ( 2 ) [ 3 ] ⟨ 4 ⟩ a^{(2)[3]\langle 4 \rangle}_5 a5(2)[3]4 表示第2个训练样本 (2), 第3层 [3], t = 4 t=4 t=4时 <4>, 激活值向量的第5个元素。


对于单个时间点(时间步)的单个样本, x ( i ) ⟨ t ⟩ x^{(i) \langle t \rangle } x(i)t是一维的向量。比如在NLP中,假设字典大小为5000,那么单词就有 ( 5000 , ) (5000,) (5000,)大小的one-hot编码, x ( i ) ⟨ t ⟩ x^{(i) \langle t \rangle } x(i)t的形状也是 ( 5000 , ) (5000,) (5000,)

我们用 n x n_x nx来表示单个时间点的单个样本的单元数,这个例子是5000。

t t t来索引时间步,输入向量时间步的长度记为 T x T_x Tx,假设我们的例子中 T x = 10 T_x=10 Tx=10

如果我们用了小批次(mini-batch),每个批次有20个样本,批次样本数记为 m m m,为了向量化,我们会按列叠加20个样本,得到一个形状为 ( 5000 , 20 , 10 ) (5000,20,10) (5000,20,10)的张量。

所以小批次的形状 ( n x , m , T x ) (n_x,m,T_x) (nx,m,Tx)

对于每个时间步 x ⟨ t ⟩ x^{\langle t \rangle} xt的形状为 ( n x , m ) (n_x,m) (nx,m)


从一个时间步传递到另一个时间步的激活值 a ⟨ t ⟩ a^{\langle t \rangle} at叫做隐藏状态。

单个训练样本的隐藏状态长度记为 n a n_a na


y ^ \hat y y^也是一个3维的张量 ( n y , m , T y ) (n_y,m,T_y) (ny,m,Ty)

  • n y n_{y} ny: 预测向量的单元数
  • m m m: 批次大小
  • T y T_{y} Ty: 预测的时间步长

同样,对于单个时间点,我们有 ( n y , m ) (n_y,m) (ny,m)大小的预测值 y ^ ⟨ t ⟩ \hat{y}^{\langle t \rangle} y^t



a ⟨ t ⟩ = tanh ⁡ ( W a a a ⟨ t − 1 ⟩ + W a x x ⟨ t ⟩ + b a ) (1) a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a) \tag{1} at=tanh(Waaat1+Waxxt+ba)(1)
z ⟨ t ⟩ = W y a a ⟨ t ⟩ + b y y ^ ⟨ t ⟩ = s o f t m a x ( z ⟨ t ⟩ ) (2) z^{\langle t \rangle} = W_{ya} a^{\langle t \rangle} + b_y \\ \hat{y}^{\langle t \rangle} = softmax(z^{\langle t \rangle} ) \tag{2} zt=Wyaat+byy^t=softmax(zt)(2)

其中 W a a W_{aa} Waa是由 a ⟨ t − 1 ⟩ a^{\langle t-1 \rangle} at1计算 a ⟨ t ⟩ a^{\langle t \rangle} at的相关权重, W a x W_{ax} Wax是由 x x x计算 a a a的相关权重,以及计算 a a a的偏差 b a b_a ba
经过 tanh ⁡ \tanh tanh激活函数后得到新的激活值,经过这个公式可以看出,新的激活值同时考虑了当前的输入和前个时间点的激活值。

得到了新的激活值后再进行一次线性运算,把结果传入 s o f t m a x softmax softmax(多分类时)中得到输出。涉及到的权重是 W y a W_{ya} Wya,由激活值 a a a计算输出 y y y,以及计算输出的相关偏差 b y b_y by

上面是计算时间点 t t t的输出过程,如果考虑整个输入序列的话就是下图所示:
一般初始激活值 a ⟨ 0 ⟩ = 0 a^{\langle0 \rangle}=0 a0=0,是一个零向量。


回到我们本文情感分析的实例,是一个多对一的结构,我们的RNN读完整段句子,然后来判断这段句子对应的情感类别(只有一个输出 y ^ \hat y y^)。

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class RNN:
    def __init__(self, n_x, n_y, n_a=32):
        n_x : 输入向量x的大小 词典大小
        n_y : 输出向量y的大小 类别数量
        n_a :隐藏单元数
        self.Wax = np.random.randn(n_a, n_x) / 10000
        self.Waa = np.random.randn(n_a, n_a) / 10000
        self.Wya = np.random.randn(n_y, n_a) / 10000
        # 偏差
        self.ba = np.random.randn(n_a, 1)
        self.by = np.random.randn(n_y, 1)

    def forward(self, x):
        x : 所有时间步的输入,形状(n_x,m,T_x)
        返回: n_y,m
        n_x, m, T_x = x.shape
        n_y, n_a = self.Wya.shape
        a = np.zeros((n_a, m))  # 初始激活值 a<⁰> = 0

        # 保存输入x,用于反向传播
        self.x = x
        # 保存每个时间点的激活值
        self.a_his = [a]

        for t in range(T_x):
            # 时间点t时的输入
            xt = x[:, :, t]
            a = np.tanh(self.Wax.dot(xt) + self.Waa.dot(a) + self.ba)  # ba会广播为(n_a,m)

        # 多对一的结构,只有在读完整个序列,即最后一个 时间步才计算输出
        y_pred = softmax(self.Wya.dot(a) + self.by)

        return y_pred


为了进行反向传播,我们需要定义一个损失函数。鉴于我们的输出类别可能有多个,我们就用交叉熵损失函数。其中 y ^ \hat y y^是我们的预测值,这里是由最后一个时间步的激活值 a ⟨ T x ⟩ a^{\langle T_x \rangle} aTx进行线性运算后经过softmax得到的;而 y y y是真实值。

L = − ∑ c y c log ⁡ y ^ c = − ∑ c y c log ⁡ ( s o f t m a x ( z c ) ) (3) L = -\sum_c y_c \log \hat y_c =-\sum_c y_c \log (softmax(z_c)) \tag{3} L=cyclogy^c=cyclog(softmax(zc))(3)

我们求 L L L z z z的导数dz,具体过程可以参考博客 Softmax与Cross-entropy的求导,得到:
y ^ − y (4) \hat y- y \tag{4} y^y(4)

y ^ i − 1 i (5) \hat y_i - 1_i \tag{5} y^i1i(5)
也就是说,就是用预测值的某一列减去 1 1 1即可,其他为零的列不变。

z ⟨ T x ⟩ = W y a a ⟨ T x ⟩ + b y y ^ ⟨ t ⟩ = s o f t m a x ( z ⟨ T x ⟩ ) z^{\langle T_x \rangle} = W_{ya} a^{\langle T_x \rangle} + by \\ \hat y ^{\langle t \rangle} = softmax(z^{\langle T_x \rangle} ) zTx=WyaaTx+byy^t=softmax(zTx)

下面我们求对 W y a , b y W_{ya},b_y Wya,by的梯度,此时只需要考虑最后一个激活值到RNN的输出值:

对于 W y a W_{ya} Wya,我们有:

∂ L ∂ W y a = ∂ L ∂ z ⟨ T x ⟩ ⋅ ∂ z ⟨ T x ⟩ ∂ W y a (6) \frac{\partial L}{\partial W_{ya}} = \frac{\partial L}{\partial z^{\langle T_x \rangle}} \cdot \frac{\partial z^{\langle T_x \rangle}}{\partial W_{ya}} \tag{6} WyaL=zTxLWyazTx(6)

其中 a ⟨ T x ⟩ a^{\langle T_x \rangle} aTx是最后一个时间步的激活值。有:
∂ z ⟨ T x ⟩ ∂ W y a = a ⟨ T x ⟩ (7) \frac{\partial z^{\langle T_x \rangle}}{\partial W_{ya}} = a^{\langle T_x \rangle} \tag{7} WyazTx=aTx(7)
∂ L ∂ W y a = ∂ L ∂ z ⟨ T x ⟩ a ⟨ T x ⟩ (8) \frac{\partial L}{\partial W_{ya}} = \boxed{ \frac{\partial L}{\partial z^{\langle T_x \rangle}} a^{\langle T_x \rangle}} \tag{8} WyaL=zTxLaTx(8)

∂ z ⟨ T x ⟩ ∂ b y = 1 (9) \frac{\partial z^{\langle T_x \rangle}}{\partial b_y} = 1 \tag{9} byzTx=1(9)
∂ L ∂ b y = ∂ L ∂ z ⟨ T x ⟩ (10) \frac{\partial L}{\partial b_y} = \boxed {\frac{\partial L}{\partial z^{\langle T_x \rangle}}} \tag{10} byL=zTxL(10)

最后,我们需要计算 W a a , W a x , b a W_{aa},W_{ax},b_a Waa,Wax,ba的梯度,它们会在RNN的每个时间步中使用。有:

∂ L ∂ W a x = ∂ L ∂ z ⟨ T x ⟩ ∑ t T x ∂ z ⟨ T x ⟩ ∂ a ⟨ t ⟩ ⋅ ∂ a ⟨ t ⟩ ∂ W a x \frac{\partial L}{\partial W_{ax}} = \frac{\partial L}{\partial z^{\langle T_x \rangle}} \sum_t^{T_x} \frac{\partial z^{\langle T_x \rangle}}{\partial a^{\langle t \rangle}} \cdot \frac{\partial a^{\langle t \rangle}}{\partial W_{ax}} WaxL=zTxLtTxatzTxWaxat
因为改变 W a x W_{ax} Wax会影响所有的 a ⟨ t ⟩ a^{\langle t \rangle} at,然后会影响 z ⟨ T x ⟩ z^{\langle T_x \rangle} zTx和最终的 L L L。在计算前向传播时,从左到右,时间点不断增加;而计算反向传播时,我们需要考虑所有的时间,时间点不断减小,来进行反向传播,这被称为BPTT(Backpropagation Through Time)。


在给定的时间点 t t t,我们需要计算 ∂ a ⟨ t ⟩ ∂ W a x \frac{\partial a^{\langle t \rangle}}{\partial W_{ax}} Waxat:

a ⟨ t ⟩ = tanh ⁡ ( W a a a ⟨ t − 1 ⟩ + W a x x ⟨ t ⟩ + b a ) a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a) at=tanh(Waaat1+Waxxt+ba)
tanh ⁡ \tanh tanh的导数我们也推导过:
∂ tanh ⁡ ( x ) ∂ x = 1 − tanh ⁡ 2 ( x ) \frac{\partial \tanh(x)} {\partial x} = 1 - \tanh^2(x) xtanh(x)=1tanh2(x)
∂ a ⟨ t ⟩ ∂ W a x = ( 1 − a ⟨ t ⟩ 2 ) x ⟨ t ⟩ (11) \frac{\partial a^{\langle t \rangle}}{\partial W_{ax}} = \boxed{(1-{a^{\langle t \rangle}}^2)x^{\langle t \rangle}} \tag{11} Waxat=(1at2)xt(11)
∂ a ⟨ t ⟩ ∂ W a a = ( 1 − a ⟨ t ⟩ 2 ) a ⟨ t − 1 ⟩ (12) \frac{\partial a^{\langle t \rangle}}{\partial W_{aa}} = \boxed{(1-{a^{\langle t \rangle}}^2)a^{\langle t -1\rangle}} \tag{12} Waaat=(1at2)at1(12)
∂ a ⟨ t ⟩ ∂ b a = ( 1 − a ⟨ t ⟩ 2 ) (13) \frac{\partial a^{\langle t \rangle}}{\partial b_a} = \boxed{(1-{a^{\langle t \rangle}}^2)} \tag{13} baat=(1at2)(13)

现在就剩下 ∂ z ⟨ T x ⟩ ∂ a ⟨ t ⟩ \frac{\partial z^{\langle T_x \rangle}}{\partial a^{\langle t \rangle}} atzTx了,,我们可以递归地求解:
∂ z ⟨ T x ⟩ ∂ a ⟨ t ⟩ = ∂ z ⟨ T x ⟩ ∂ a ⟨ t + 1 ⟩ ⋅ ∂ a ⟨ t + 1 ⟩ ∂ a ⟨ t ⟩ = ∂ z ⟨ T x ⟩ ∂ a ⟨ t + 1 ⟩ ( 1 − a ⟨ t ⟩ 2 ) W a a (14) \begin{aligned} \frac{\partial z^{\langle T_x \rangle}}{\partial a ^{\langle t \rangle}} &=\frac{\partial z^{\langle T_x \rangle}}{\partial a ^{\langle t+1 \rangle}} \cdot \frac{\partial a^{\langle t+1 \rangle}}{\partial a ^{\langle t \rangle}} \\ &= \frac{\partial z^{\langle T_x \rangle}}{\partial a ^{\langle t+1 \rangle}}(1 - {a ^{\langle t \rangle}}^2) W_{aa} \end{aligned} \tag{14} atzTx=at+1zTxatat+1=at+1zTx(1at2)Waa(14)

如上图所示,在多对多结构中,在 T x T_x Tx时, y ^ ⟨ T x ⟩ \hat y ^{\langle T_x \rangle} y^Tx作为后续节点,此时的计算公式如下 ( 16 ) (16) (16)所示。而在时间步 t    ( 1 ≤ t < T x ) t\,\,(1 \leq t < T_x) t(1t<Tx)时,后续节点有两个,分别是当前时刻的输出 z ⟨ t ⟩ z ^{\langle t \rangle} zt和当前时刻的激活值 a ⟨ t ⟩ a ^{\langle t \rangle} at。在求梯度时要考虑这两种情况。

我们实现BPTT的时候会从最后一个激活值开始,然后反向传播。所以当需要计算 ∂ z ⟨ T x ⟩ a ⟨ t ⟩ \frac{\partial z^{\langle T_x \rangle}}{a ^{\langle t \rangle}} atzTx时,我们已经计算了 ∂ z ⟨ T x ⟩ a ⟨ t + 1 ⟩ \frac{\partial z^{\langle T_x \rangle}}{a ^{\langle t+1 \rangle}} at+1zTx,除了计算最后一个激活值:

∂ z ⟨ T x ⟩ a ⟨ T x ⟩ = W y a (15) \frac{\partial z^{\langle T_x \rangle}}{a ^{\langle T_x \rangle}}= W_{ya} \tag{15} aTxzTx=Wya(15)
∂ L a ⟨ T x ⟩ = ∂ L Z ⟨ T x ⟩ ⋅ ∂ z ⟨ T x ⟩ a ⟨ T x ⟩ = ∂ L Z ⟨ T x ⟩ W y a (16) \frac{\partial L}{a ^{\langle T_x \rangle}}= \frac{\partial L}{Z ^{\langle T_x \rangle}} \cdot \frac{\partial z^{\langle T_x \rangle}}{a ^{\langle T_x \rangle}} =\boxed{ \frac{\partial L}{Z ^{\langle T_x \rangle}} W_{ya}} \tag{16} aTxL=ZTxLaTxzTx=ZTxLWya(16)



def backward(self, dz, learning_rate):
        dz: 对z的梯度 形状(n_y,m)
        learning_rate: 学习率
        n_x, m, T_x = self.x.shape
        n_a = self.Wax.shape[0]

        # 计算dWya和dby
        dWya = dz.dot(self.a_his[T_x].T)
        dby = np.sum(dz, axis=1, keepdims=True)

        dWax = np.zeros((n_a, n_x))
        dWaa = np.zeros((n_a, n_a))
        dba = np.zeros((n_a, 1))

        # 计算最后一个激活值的梯度 公式(16)
        da = np.dot(self.Wya.T, dz)  # (n_a,m)

        # BPTT
        for t in reversed(range(T_x)):
            # 计算da * (1-a^2)
            temp = np.multiply(da, 1 - self.a_his[t + 1] ** 2)  # (n_a,m)
            # 计算dba
            dba += np.sum(temp, axis=1, keepdims=True)
            dWaa += temp.dot(self.a_his[t].T)
            dWax += temp.dot(self.x[:, :, t].T)  # n_a,n_x
            da += np.dot(self.Waa, temp)

        # 防止梯度爆炸
        for d in [dWax, dWaa, dWya, dba, dby]:
            np.clip(d, -1, 1, out=d)  # 将梯度限制在[-1,1]

        self.Waa -= learning_rate * dWaa
        self.Wax -= learning_rate * dWax
        self.Wya -= learning_rate * dWya
        self.ba -= learning_rate * dba
        self.by -= learning_rate * dby


import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class RNN:
    def __init__(self, n_x, n_y, n_a=32):
        n_x : 输入向量x的大小 词典大小
        n_y : 输出向量y的大小 类别数量
        n_a :隐藏单元数
        self.Wax = np.random.randn(n_a, n_x) / 10000
        self.Waa = np.random.randn(n_a, n_a) / 10000
        self.Wya = np.random.randn(n_y, n_a) / 10000
        # 偏差
        self.ba = np.random.randn(n_a, 1)
        self.by = np.random.randn(n_y, 1)

    def forward(self, x):
        x : 所有时间步的输入,形状(n_x,m,T_x)
        返回: n_y,m
        n_x, m, T_x = x.shape
        n_y, n_a = self.Wya.shape
        a = np.zeros((n_a, m))  # 初始激活值 a<⁰> = 0

        # 保存输入x,用于反向传播
        self.x = x
        # 保存每个时间点的激活值
        self.a_his = [a]

        for t in range(T_x):
            # 时间点t时的输入
            xt = x[:, :, t]
            a = np.tanh(self.Wax.dot(xt) + self.Waa.dot(a) + self.ba)  # ba会广播为(n_a,m)

        # 多对一的结构,只有在读完整个序列,即最后一个 时间步才计算输出
        y_pred = softmax(self.Wya.dot(a) + self.by)

        return y_pred

    def backward(self, dz, learning_rate):
        dz: 对z的梯度 形状(n_y,m)
        learning_rate: 学习率
        n_x, m, T_x = self.x.shape
        n_a = self.Wax.shape[0]

        # 计算dWya和dby
        dWya = dz.dot(self.a_his[T_x].T)
        dby = np.sum(dz, axis=1, keepdims=True)

        dWax = np.zeros((n_a, n_x))
        dWaa = np.zeros((n_a, n_a))
        dba = np.zeros((n_a, 1))

        # 计算最后一个激活值的梯度 公式(16)
        da = np.dot(self.Wya.T, dz)  # (n_a,m)

        # BPTT
        for t in reversed(range(T_x)):
            # 计算da * (1-a^2)
            temp = np.multiply(da, 1 - self.a_his[t + 1] ** 2)  # (n_a,m)
            # 计算dba
            dba += np.sum(temp, axis=1, keepdims=True)
            dWaa += temp.dot(self.a_his[t].T)
            dWax += temp.dot(self.x[:, :, t].T)  # n_a,n_x
            da += np.dot(self.Waa, temp)

        # 防止梯度爆炸
        for d in [dWax, dWaa, dWya, dba, dby]:
            np.clip(d, -1, 1, out=d)  # 将梯度限制在[-1,1]

        self.Waa -= learning_rate * dWaa
        self.Wax -= learning_rate * dWax
        self.Wya -= learning_rate * dWya
        self.ba -= learning_rate * dba
        self.by -= learning_rate * dby

    def fit(self, X_train, Y_train, epochs=30, mini_batch_size=20, learning_rate=2e-2, print_cost=False, X_test=None,

        param X_train:  input data of size (n_x, m,T_x)
        param Y_train:  labels of shape (n_y,m)
        m = X_train.shape[1]

        for i in range(epochs):
            indexes = np.random.permutation(m)

            X_mini_batches = [
                X_train[:, indexes, :][:, k:k + mini_batch_size, :] for k in range(0, m, mini_batch_size)
            y_mini_batches = [
                Y_train[:, indexes][:, k:k + mini_batch_size] for k in range(0, m, mini_batch_size)
            num_correct = 0
            for X_batch, y_batch in zip(X_mini_batches, y_mini_batches):
                y_pred = self.forward(X_batch)
                dz = y_pred - y_batch
                self.backward(dz, learning_rate)

                i_pred = np.argmax(y_pred, axis=0)
                i_batch = np.argmax(y_batch, axis=0)

                num_correct += np.sum(i_pred == i_batch)

            if print_cost and i % 100 == 99 and X_test is not None:
                print("Train accuracy : %.3f \t Test accuracy : %.3f after iteration %i" % (num_correct/m,self.evaluate(X_test,Y_test) , i+1))

    def evaluate(self, X_test, Y_test):
        y_pred = self.forward(X_test)
        i_pred = np.argmax(y_pred, axis=0)
        i_test = np.argmax(Y_test, axis=0)

        return np.sum(i_pred == i_test) / X_test.shape[1]


train_data = {
  'good': 'T',
  'bad': 'F',
  'happy': 'T',
  'sad': 'F',
  'not good': 'F',
  'not bad': 'T',
  'not happy': 'F',
  'not sad': 'T',
  'very good': 'T',
  'very bad': 'F',
  'very happy': 'T',
  'very sad': 'F',
  'i am happy': 'T',
  'this is good': 'T',
  'i am bad': 'F',
  'this is bad': 'F',
  'i am sad': 'F',
  'this is sad': 'F',
  'i am not happy': 'F',
  'this is not good': 'F',
  'i am not bad': 'T',
  'this is not sad': 'T',
  'i am very happy': 'T',
  'this is very good': 'T',
  'i am very bad': 'F',
  'this is very sad': 'F',
  'this is very happy': 'T',
  'i am good not bad': 'T',
  'this is good not bad': 'T',
  'i am bad not good': 'F',
  'i am good and happy': 'T',
  'this is not good and not happy': 'F',
  'i am not at all good': 'F',
  'i am not at all bad': 'T',
  'i am not at all happy': 'F',
  'this is not at all sad': 'T',
  'this is not at all happy': 'F',
  'i am good right now': 'T',
  'i am bad right now': 'F',
  'this is bad right now': 'F',
  'i am sad right now': 'F',
  'i was good earlier': 'T',
  'i was happy earlier': 'T',
  'i was bad earlier': 'F',
  'i was sad earlier': 'F',
  'i am very bad right now': 'F',
  'this is very good right now': 'T',
  'this is very sad right now': 'F',
  'this was bad earlier': 'F',
  'this was very good earlier': 'T',
  'this was very bad earlier': 'F',
  'this was very happy earlier': 'T',
  'this was very sad earlier': 'F',
  'i was good and not bad earlier': 'T',
  'i was not good and not happy earlier': 'F',
  'i am not at all bad or sad right now': 'T',
  'i am not at all good or happy right now': 'F',
  'this was not happy and not good earlier': 'F',

test_data = {
  'this is happy': 'T',
  'i am good': 'T',
  'this is not happy': 'F',
  'i am not good': 'F',
  'this is not bad': 'T',
  'i am not sad': 'T',
  'i am very good': 'T',
  'this is very bad': 'F',
  'i am very sad': 'F',
  'this is bad not good': 'F',
  'this is good and happy': 'T',
  'i am not good and not happy': 'F',
  'i am not at all sad': 'T',
  'this is not at all good': 'F',
  'this is not at all bad': 'T',
  'this is good right now': 'T',
  'this is sad right now': 'F',
  'this is very bad right now': 'F',
  'this was good earlier': 'T',
  'i was not happy and not good earlier': 'F',

# -*- coding: utf-8 -*-
# @Time    : 2020-9-22 13:51
# @Author  : Jue
import numpy as np

def text2vector(text, vocab_size, T_x):
	返回一个句子的one-hot向量表示 (n_x,T_x)
	- text :句子
	- vocab_size : 词典大小
	- T_x : 句子最长长度
	inputs = np.zeros((vocab_size, T_x))
	t = 0
	for w in text.split(' '):
		v = np.zeros(vocab_size)
		v[word_to_idx[w]] = 1
		inputs[:, t] = v
		t += 1
		if t == T_x:

	return inputs

# 返回句子最大长度
def getTx(train_data):
	return max(len(text.split(' ')) for text in train_data)

def getData(data, vocab_size, label_dic):
		x : 所有时间步的输入,形状(n_x,m,T_x)
	# text ,label
	t_x = getTx(data.keys())
	items = list(data.items())
	m = len(items)
	X = np.zeros((vocab_size, t_x, m))  # n_x,t_x,m
	Y = np.zeros((2, m))

	i = 0
	for x, y in items:
		inputs = text2vector(x, vocab_size, t_x)
		X[..., i] = inputs
		Y[label_dic[y], i] = 1
		i += 1

	X = X.transpose(0, 2, 1)

	return X, Y

if __name__ == '__main__':
	labels = ['T', 'F']
	d = {l: i for i, l in enumerate(labels)}

	vocab = list(set([w for text in train_data.keys() for w in text.split(' ')]))
	vocab_size = len(vocab)
	print('%d unique words found' % vocab_size)

	# Assign indices to each word.
	word_to_idx = {w: i for i, w in enumerate(vocab)}
	idx_to_word = {i: w for i, w in enumerate(vocab)}

	X_train, Y_train = getData(train_data, vocab_size, d)
	X_test, Y_test = getData(test_data, vocab_size, d)



rnn = RNN(vocab_size, 2)
rnn.fit(X_train, Y_train, mini_batch_size=8, epochs=2000, print_cost=True,X_test=X_test,Y_test=Y_test)



  1. 吴恩达深度学习课程
  2. An Introduction to Recurrent Neural Networks for Beginners
  3. 吴恩达深度学习——循环神经网络
  4. Softmax与Cross-entropy的求导
