用Python, Numpy 和 Theano实现RNN

这篇文章的的原版在这儿:Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano,一共四篇这是其中的第二篇。实在是很好,讲的很细致,又有代码。为了方便回忆就大致翻译(不是逐字逐句)了一下。如果有版权问题,可以留言给我。

用Python, Numpy 和 Theano实现RNN

数据获取和预处理

数据来源:

reddit comments from a dataset available on Google’s BigQuery

数据预处理

断字:

我们的目标是根据已输入的词,产生新的词。为此,需要先把文本规整为句子,再把句子分为一个个的字。会用到NLTKword_tokenizeand sent_tokenize函数来帮助我们处理。

控制单词数量:
1. 删除不常用的单词

我们在学习一个单词时,如果没有见过足够多的用法,是没办法学会的。模型也是如此,如果单词在我们的文本数据中只出现一两次,模型就没办法学会这个单词的用法,只会增加模型的训练难度和效率,所以我们最好将其删除。根据单词出现从大到小的频率,截取前vocabulary_size个单词。

2. 使用UNKNOWN_TOKEN作为代替

对于出现在训练数据中但是被删除(由于出现次数较少)的单词,我们可以统一使用一个特定的单词来代替,例如UNKNOWN_TOKEN。举个例子,假如单词nonlinearities是一个应该被删掉的单词,句子“nonlineraties are important in neural networks” 就变成了 “UNKNOWN_TOKEN are important in Neural Networks”. 这个词UNKNOWN_TOKEN将会变为我们词库的一部分,也会像其他词一样可以被预测。假如在预测时产生的词中有UNKNOWN_TOKEN,我们可以在输出时随机的选取其他的不在词库中的词汇代替。或者可以不断的产生词直到这个词不再出现。

3. 配置开始和结束符号

为了能够预测出句子开头和结尾的词,我们需要分别指定开头和结尾的特殊符号。这里用SENTENCE_START作为开头, 用SENTENCE_END当做结尾。即在训练集的每个句子的开头和结尾分别加上这两个词.当输入是SENTENCE_START的时候,我们希望他的label是真正的第一个词。

4. 建立训练数据矩阵

神经网络的输入只能是向量,不能是字符串,因此我们需要在词汇和数字之间进行映射。包括index_to_wordword_to_index。例如“friendly”这个词的索引可能是2001.一个输入inputs x = [0, 179, 341, 416], 0代表SENTENCE_START.label y= [179, 341, 416, 1]. 我们的目的是预测下一个词,所以y仅仅是inputs向左移动一了位,并添加SENTENCE_END作为结尾.换句话说179的预测值就是341.

数据获取和预处理的代码实现:

vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies(统计频率)
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

print "\nExample sentence: '%s'" % sentences[0]
print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0]

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])


Here’s an actual training example from our text:

x:
SENTENCE_START what are n't you understanding about this ? !
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y:
what are n't you understanding about this ? ! SENTENCE_END
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

建立RNN模型

在这里插入图片描述
理解图中字母的意义很重要:
x x x代表一个完整的输入,就是一个sample(样例),这里可以理解为一句话,比如:知识就是力量。
x t x_t xt表示一个单独的输入。比如上面的一个单独的字:知、识等。
S t S_{t} St表示RNN中的hidden state。
O O O表示预测值,比如我们的词库有8000个词, O O O就是一个表示概率分布的长度为8000的向量。就是神经网络经过softmax之后输出的那个东西。
U 、 V 、 W U、V、W UVW是需要训练的参数

RNN模型的input x 是一串词组成的句子,每一个 x t x_t xt是一个单个的词.但是由于矩阵乘法的工作原理,我们不能简单地使用一个单词索引(比如36)作为输入。因此我们要为每个字转为 one-hot 向量表示法。例如索引为5的词变为[0, 0, 0, 0, 1, 0, …],这样的话,每个词 x t x_t xt都变为一个向量,输入 x x x就变成了一个矩阵,每一行都代表一个词.这一步的操作我们不放在数据预处理中而是RNN网络构建中。网络的输出output也使用相同的格式. 每一个 O t O_t Ot是一个长度为vocabulary_size的向量,向量中的每个元素代表了label的概率,每一个元素代表了写一个词在句子中出现的概率。
如果我们用大写字母C代表vocabulary_size,使 C = 8000 , U 、 V 、 W C=8000,U、V、W C=8000UVW作为网络的参数,网络的隐藏层size用H表示,设H = 100,那么网络的总参数量应该是 2 H C + H 2 = 1 , 610 , 000 2HC+H^2=1,610,000 2HC+H2=1,610,000个。其中 U 、 V 、 W U、V、W UVW的初始化不可以为0.有推荐初始化范围是 [ − 1 n , 1 n ] [-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}] [n 1n 1].
参数的取值范围:

x t ∈ R 8000 x_t \in R^{8000} xtR8000
o t ∈ R 8000 o_t \in R^{8000} otR8000
s t ∈ R 100 s_t \in R^{100} stR100
U ∈ R 100 × 8000 U \in R^{100\times8000} UR100×8000
V ∈ R 8000 × 100 V \in R^{8000\times100} VR8000×100
W ∈ R 100 × 100 W \in R^{100\times100} WR100×100

初始化

Initialization

class RNNNumpy:

    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
# word_dim is the size of our vocabulary, and hidden_dim is the size of our hidden layer  Don’t worry about the bptt_truncate parameter for now, we’ll explain what that is later.
前向传播

Forward Propagation

def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    # For each time step...
    for t in np.arange(T):
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]

RNNNumpy.forward_propagation = forward_propagation
# s[t-1] : 上一个隐藏状态
# 输出和隐藏状态都需要保留,之后要通过隐藏状态计算梯度,一个o_t就是一个表示概率分布的向量。

在评估我们的模型时,我们想要的只是概率最高的下一个单词。我们把这个函数叫做预测:输入最后的输出的softmax是一个长度为8000的概率分布向量,这个predict函数所做的就是选出这个向量中的最大值的位置。

预测值

predict函数

def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNNNumpy.predict = predict

测试样例输出

np.random.seed(10) # 程序中有使用随机函数时,都要设置随机种子,这里面UVW使用了随机函数
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print o
# (45, 8000) # 45表示有X的长度为45,也可以理解为句子有45个词汇的输入。
# [[ 0.00012408  0.0001244   0.00012603 ...,  0.00012515  0.00012488
#    0.00012508]
#  [ 0.00012536  0.00012582  0.00012436 ...,  0.00012482  0.00012456
#    0.00012451]
#  [ 0.00012387  0.0001252   0.00012474 ...,  0.00012559  0.00012588
#    0.00012551]
#  ...,
#  [ 0.00012414  0.00012455  0.0001252  ...,  0.00012487  0.00012494
#    0.0001263 ]
#  [ 0.0001252   0.00012393  0.00012509 ...,  0.00012407  0.00012578
#    0.00012502]
#  [ 0.00012472  0.0001253   0.00012487 ...,  0.00012463  0.00012536
#    0.00012665]]

上面所使用的参数 U 、 V 、 W U、V、W UVW是随机的,因此最后的结果也是随机的。下面的预测函数给出每个词的输出结果。

predictions = model.predict(X_train[10])  # 这个每个下标好像就是对应着45个词?需要认真看看上面的数据处理函数。
print predictions.shape
print predictions
# (45,)
# [1284 5221 7653 7430 1013 3562 7366 4860 2212 6601 7299 4556 2481 238 2539
#  21 6548 261 1780 2005 1810 5376 4146 477 7051 4832 4991 897 3485 21
#  7291 2007 6006 760 4864 2182 6569 2800 2752 6821 4437 7021 7875 6912 3575]
损失函数

损失函数使用 cross-entropy loss进行计算:
L ( y , o ) = − 1 N ∑ y n l o g o n L(y,o) = - \frac{1}{N} \sum{y_n log{o_n}} L(y,o)=N1ynlogon
y (正确的词) 和o (我们预测的词)越远,loss就越大。

calculate_loss

def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss

我们看一下随机情况下的预测值怎么样,与baseline对比看看前向传播是否正确;准备一个baseline:我们有C个词汇,有 N N N个样例。如果每个词汇出现的概率都一样,那么每个词的概率是1/C,那么损失函数就是: L = − 1 N N l o g 1 C = l o g C L = - \frac{1}{N} N log {\frac{1}{C}} = log C L=N1NlogC1=logC

# 这里一个example就是上面的45个词。
# Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])
Expected Loss for random predictions: 8.987197
Actual loss: 8.987440

这里使用了1000个样例,前向传播得出的结果与baseline时相当接近,说明我们的前项传播过程没有错误。

使用SGD和BPTT训练

要训练参数最小化,最经常使用的方式就是SGD(随机梯度下降法:迭代所有样例,在每个样例迭代的过程中把参数朝着使loss值减小的方向移动。loss值渐进的方向由 ∂ L ∂ U , ∂ L ∂ V , ∂ L ∂ W \frac{\partial L}{\partial U}, \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W} UL,VL,WL决定), SGD的学习率决定了在一次迭代中前进一步的跨度是多大。不论是神经网络还是传统的机器学习算法,SGD都是最重要的优化方法。因此人们对用batch方法优化SGD,并行化和自适应学习率方面进行了大量的研究。
反向传播算法可以计算梯度。在RNN中我们使用被称为BPTT的BP修改版本。 因为参数共享机制,所以每一步的输出不仅与当前而且与之前的时间步相关。这就要用到链式法则。 这里不细讲,只知道当输入样例(x,y)时,返回梯度 ∂ L ∂ U , ∂ L ∂ V , ∂ L ∂ W \frac{\partial L}{\partial U}, \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W} UL,VL,WL
反向传播

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNNNumpy.bptt = bptt
梯度检验

不仅前向传播需要检验,梯度也需要检验,以保证正确性。梯度检验的思想是,参数的导数等于点处的斜率,我们可以通过稍微改变参数,然后除以变化来进行近似:
∂ L ∂ θ = lim ⁡ h → 0 J ( θ + h ) − J ( θ − h ) 2 h \frac {\partial L} {\partial \theta} = \lim_{h\rightarrow 0}\frac{J(\theta + h) - J(\theta - h)}{2h} θL=h0lim2hJ(θ+h)J(θh)
J J J是曲线函数, h h h代表极小值, θ \theta θ代表某一点。
我们把代码计算出来的梯度与数学公式计算出来的梯度进行比较,如果相差不太大则说明我们写的函数是可用的。因为运算量较大,所以下面代码较少了词库的容量。

def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = self.bptt(x, y)
    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']
    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):
        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index
            # Save the original value so we can reset it later
            original_value = parameter[ix]
            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = self.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = self.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)
            # Reset parameter to original value
            parameter[ix] = original_value
            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]
            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))
            # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
                print "+h Loss: %f" % gradplus
                print "-h Loss: %f" % gradminus
                print "Estimated_gradient: %f" % estimated_gradient
                print "Backpropagation gradient: %f" % backprop_gradient
                print "Relative Error: %f" % relative_error
                return
            it.iternext()
        print "Gradient check for parameter %s passed." % (pname)

RNNNumpy.gradient_check = gradient_check

# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 100
np.random.seed(10)
model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])
SGD实现
two steps:
  1. sdg_step函数计算梯度,然后在batch上进行更新。
  2. 外部循环迭代训练集并调整学习率。
# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNNNumpy.sgd_step = numpy_sdg_step
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch):
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5
                print "Setting learning rate to %f" % learning_rate
            sys.stdout.flush()
        # For each training example...
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

#  Done! Let’s try to get a sense of how long it would take to train our network:

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

在整个数据集上迭代一次成为一个epoch

p.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNNNumpy(vocabulary_size)
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=10, evaluate_loss_after=1)
2015-09-30 10:08:19: Loss after num_examples_seen=0 epoch=0: 8.987425
2015-09-30 10:08:35: Loss after num_examples_seen=100 epoch=1: 8.976270
2015-09-30 10:08:50: Loss after num_examples_seen=200 epoch=2: 8.960212
2015-09-30 10:09:06: Loss after num_examples_seen=300 epoch=3: 8.930430
2015-09-30 10:09:22: Loss after num_examples_seen=400 epoch=4: 8.862264
2015-09-30 10:09:38: Loss after num_examples_seen=500 epoch=5: 6.913570
2015-09-30 10:09:53: Loss after num_examples_seen=600 epoch=6: 6.302493
2015-09-30 10:10:07: Loss after num_examples_seen=700 epoch=7: 6.014995
2015-09-30 10:10:24: Loss after num_examples_seen=800 epoch=8: 5.833877
2015-09-30 10:10:39: Loss after num_examples_seen=900 epoch=9: 5.710718

事实证明随着迭代的进行训练集的loss确实在减小。

用Theano和GPU训练网络

因为没有GPU,剩下一点点我就不翻译了,只贴一点代码。。。逃
Theano and GPU code
Speeding up your Neural Network with Theano and the GPU

np.random.seed(10)
model = RNNTheano(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)
from utils import load_model_parameters_theano, save_model_parameters_theano
 
model = RNNTheano(vocabulary_size, hidden_dim=50)
# losses = train_with_sgd(model, X_train, y_train, nepoch=50)
# save_model_parameters_theano('./data/trained-model-theano.npz', model)
load_model_parameters_theano('./data/trained-model-theano.npz', model)
Generating Text
Now that we have our model we can ask it to generate new text for us! Let’s implement a helper function to generate new sentences:

def generate_sentence(model):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    # Repeat until we get an end token
    while not new_sentence[-1] == word_to_index[sentence_end_token]:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs[-1])
            sampled_word = np.argmax(samples)
        new_sentence.append(sampled_word)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    return sentence_str
 
num_sentences = 10
senten_min_length = 7
 
for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model)
    print " ".join(sent)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值