Anna Karenina 文本生成（LSTM模型）

最新推荐文章于 2024-08-07 15:23:57 发布

qq_15160141

最新推荐文章于 2024-08-07 15:23:57 发布

阅读量1.6k

点赞数 1

文章标签： tensorflow LSTM RNN

本文链接：https://blog.csdn.net/qq_15160141/article/details/82466404

版权

Anna Karenina 文本生成（LSTM模型）

项目描述：以安娜卡列妮娜这本英文书籍作为测试集，训练LSTM模型。输入是单个字符，通过学习整个英文文档的字符来进行文本生成，输出是预测出来的新字符。

用到的库：time,numpy,tensorflow
用到的函数：
tf.contrib.rnn.DropoutWrapper # 为了防止过拟合，在它的隐层添加了dropout正则
tf.contrib.rnn.MultiRNNCell #MultiRNNCell实现了对基本LSTM cell的顺序堆叠
tf.nn.softmax #softmax层返回概率分布
tf.nn.softmax_cross_entropy_with_logits #采用softmax_cross_entropy_with_logits交叉熵来计算loss
tf.clip_by_global_norm #会返回clip以后的gradients以及global_norm
tf.train.AdamOptimizer #优化器
tf.nn.dynamic_rnn #运行RNN序列

正文部分：主要包括以下四个部分

数据预处理：加载数据、转换数据、分割数据mini-batch
模型构建：输入层，LSTM层，输出层，训练误差，loss，optimizer
模型训练：设置模型参数对模型进行训练
生成新文本：训练新的文本

一. 数据预处理

这一部分主要包括了数据的转换与mini-batch的分割步骤。

首先我们来进行数据的加载与编码转换。由于我们是基于字符（字母和标点符号等单个字符串，以下统称为字符）进行模型构建，也就是说我们的输入和输出都是字符。举个栗子，假如我们有一个单词“hello”，我们想要基于这个单词构建LSTM，那么希望的到的结果是，输入“h”，预测下一个字母为“e”；输入“e”时，预测下一个字母为“l”，等等。因此我们的输入便是一个个字母，下面我们将文章进行转换。

import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

把文章中的字符都转换成整数，存入数据集中

with open('anna.txt', 'r') as f:
    text=f.read()
vocab = set(text)
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

分割mini-batch

我们定义一个batch中的序列个数为N（即n_seqs），定义单个序列长度为M（也就是我们的num_steps）。那么实际上我们每个batch是一个N* M的数组，相当于我们的每个batch中有N*M个字符。在上图中，当我们设置N=2， M=3时，我们可以得到每个batch的大小为2 x 3 = 6个字符，整个序列可以被分割成12 / 6 = 2个batch。

这里写图片描述
(上图有误，中间不是Batch size,应该是一个batch中的序列个数n_seqs)

def get_batches(arr, n_seqs, n_steps):
    '''
    对已有的数组进行mini-batch分割

    arr: 待分割的数组
    n_seqs: 一个batch中序列个数
    n_steps: 单个序列包含的字符数
    '''

    batch_size = n_seqs * n_steps
    n_batches = int(len(arr) / batch_size)
    # 这里我们仅保留完整的batch，对于不能整除的部分进行舍弃
    arr = arr[:batch_size * n_batches]

    # 重塑
    arr = arr.reshape((n_seqs, -1))

    for n in range(0, arr.shape[1], n_steps):
        # inputs
        x = arr[:, n:n+n_steps]
        # targets
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
        yield x, y

上面的代码定义了一个generator，调用函数会返回一个generator对象，我们可以获取一个batch。经过上面的步骤，我们已经完成了对数据集的预处理。下一步我们开始构建模型。

二. 模型构建

模型构建部分主要包括了输入层，LSTM层，输出层，loss，optimizer等部分的构建，我们将一块一块来进行实现。

1.输入层

在数据预处理阶段，我们定义了mini-batch的分割函数，输入层的size取决于我们设置的batch_size（n_seqs* n_steps），下面我们首先构建输入层。

def build_inputs(num_seqs, num_steps):
    '''
    构建输入层

    num_seqs: 每个batch中的序列个数
    num_steps: 每个序列包含的字符数
    '''
    inputs = tf.placeholder(tf.int32, shape=(num_seqs, num_steps), name='inputs')
    targets = tf.placeholder(tf.int32, shape=(num_seqs, num_steps), name='targets')

    # 加入keep_prob
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    return inputs, targets, keep_prob

2.LSTM层

LSTM层是整个神经网络的关键部分。TensorFlow中，tf.contrib.rnn模块中有BasicLSTMCell和LSTMCell两个包，它们的区别在于：

BasicLSTMCell does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.（来自TensorFlow官网）

在这里我们仅使用基本模块BasicLSTMCell。

def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' 
    构建lstm层

    keep_prob
    lstm_size: lstm隐层中结点数目
    num_layers: lstm的隐层数目
    batch_size: batch_size

    '''
    # 构建一个基本lstm单元
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

    # 添加dropout
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

    # 堆叠
    cell = tf.contrib.rnn.MultiRNNCell([drop for _ in range(num_layers)])
    initial_state = cell.zero_state(batch_size, tf.float32)

    return cell, initial_state

构建好LSTM cell后，为了防止过拟合，在它的隐层添加了dropout正则。后面的MultiRNNCell实现了对基本LSTM cell的顺序堆叠，它接收的是cell对象组成的list。最后initial_state定义了初始cell state。

3.输出层

到目前为止，我们的输入和LSTM层都已经构建完毕。接下来就要构造我们的输出层，输出层采用softmax，它与LSTM进行全连接。对于每一个字符来说，它经过LSTM后的输出大小是1* L（L为LSTM cell隐层的结点数量），我们上面也分析过输入一个N*M的batch，我们从LSTM层得到的输出为N*M*L，要将这个输出与softmax全连接层建立连接，就需要对LSTM的输出进行重塑，变成(N*M)* L 的一个2D的tensor。softmax层的结点数应该是vocab的大小（我们要计算概率分布）。因此整个LSTM层到softmax层的大小为L *vocab_size。

def build_output(lstm_output, in_size, out_size):
    ''' 
    构造输出层

    lstm_output: lstm层的输出结果
    in_size: lstm输出层重塑后的size
    out_size: softmax层的size

    '''

    # 将lstm的输出按照列concate，例如[[1,2,3],[7,8,9]],
    # tf.concat的结果是[1,2,3,7,8,9]
    seq_output = tf.concat(lstm_output, axis=1) # tf.concat(concat_dim, values)
    # reshape
    x = tf.reshape(seq_output, [-1, in_size])

    # 将lstm层与softmax层全连接
    with tf.variable_scope('softmax'):
        softmax_w = tf.Variable(tf.truncated_normal([in_size, out_size], stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))

    # 计算logits
    logits = tf.matmul(x, softmax_w) + softmax_b

    # softmax层返回概率分布
    out = tf.nn.softmax(logits, name='predictions')

    return out, logits

将数据重塑后，我们对LSTM层和softmax层进行连接。并计算logits和softmax后的概率分布。

4.训练误差计算

至此我们已经完成了整个网络的构建，接下来要定义train loss和optimizer。我们知道从sotfmax层输出的是概率分布，因此我们要对targets进行one-hot编码。我们采用softmax_cross_entropy_with_logits交叉熵来计算loss。

def build_loss(logits, targets, lstm_size, num_classes):
    '''
    根据logits和targets计算损失

    logits: 全连接层的输出结果（不经过softmax）
    targets: targets
    lstm_size
    num_classes: vocab_size

    '''

    # One-hot编码
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())

    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)

    return loss

5.Optimizer

我们知道RNN会遇到梯度爆炸（gradients exploding）和梯度弥散（gradients disappearing)的问题。LSTM解决了梯度消失的问题，但是gradients仍然可能会爆炸，因此我们采用gradient clippling的方式来防止梯度爆炸。即通过设置一个阈值，当gradients超过这个阈值时，就将它重置为阈值大小，这就保证了梯度不会变得很大。

def build_optimizer(loss, learning_rate, grad_clip):
    ''' 
    构造Optimizer

    loss: 损失
    learning_rate: 学习率

    '''

    # 使用clipping gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))

    return optimizer

tf.clip_by_global_norm会返回clip以后的gradients以及global_norm。整个学习过程采用AdamOptimizer

6.模型组合

经过上面五个步骤，我们完成了所有的模块设置。下面我们来将这些部分组合起来，构建一个类。

class CharRNN:

    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):

        # 如果sampling是True，则采用SGD
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()

        # 输入层
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # LSTM层
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        # 对输入进行one-hot编码
        x_one_hot = tf.one_hot(self.inputs, num_classes)

        # 运行RNN
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=self.initial_state)
        self.final_state = state

        # 预测结果
        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)

        # Loss 和 optimizer (with gradient clipping)
        self.loss = build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

我们使用tf.nn.dynamic_run来运行RNN序列。

三模型训练

参数设置
在模型训练之前，我们首先初始化一些参数，我们的参数主要有：

num_seqs: 单个batch中序列的个数
num_steps: 单个序列中字符数目
lstm_size: 隐层结点个数
num_layers: LSTM层个数
learning_rate: 学习率
keep_prob: dropout层中保留结点比例

batch_size = 100         # Sequences per batch
num_steps = 100          # Number of sequence steps per batch
lstm_size = 512         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.001    # Learning rate
keep_prob = 0.5         # Dropout keep probability

这是我自己设置的一些参数，具体一些调参经验可以参考Andrej Karpathy的git上的建议。参数设置完毕后，离运行整个LSTM就差一步啦，下面我们来运行整个模型。

epochs = 20
# 每n轮进行一次变量保存
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, 
                                                 model.final_state, 
                                                 model.optimizer], 
                                                 feed_dict=feed)

            end = time.time()
            # control the print lines
            if counter % 100 == 0:
                print('轮数: {}/{}... '.format(e+1, epochs),
                      '训练步数: {}... '.format(counter),
                      '训练误差: {:.4f}... '.format(batch_loss),
                      '{:.4f} sec/batch'.format((end-start)))

            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

我这里设置的迭代次数为20次，并且在代码运行中我们设置了结点的保存，设置了每运行200次进行一次变量保存，这样的好处是有利于我们后面去直观地观察在整个训练过程中文本生成的结果是如何一步步“进化”的。

四文本生成

现在我们可以基于我们的训练参数进行文本的生成。当我们输入一个字符时，LSTM会预测下一个字符，我们再将新的字符进行输入，这样能不断的循环下去生成本文。

为了减少噪音，每次的预测值我会选择最可能的前5个进行随机选择，比如输入h，预测结果概率最大的前五个为[o,e,i,u,b]，我们将随机从这五个中挑选一个作为新的字符，让过程加入随机因素会减少一些噪音的生成。

def pick_top_n(preds, vocab_size, top_n=5):
    """
    从预测结果中选取前top_n个最可能的字符

    preds: 预测结果
    vocab_size
    top_n
    """
    p = np.squeeze(preds)
    # 将除了top_n个预测值的位置都置为0
    p[np.argsort(p)[:-top_n]] = 0
    # 归一化概率
    p = p / np.sum(p)
    # 随机选取一个字符
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    """
    生成新文本

    checkpoint: 某一轮迭代的参数文件
    n_sample: 新闻本的字符长度
    lstm_size: 隐层结点数
    vocab_size
    prime: 起始文本
    """
    # 将输入的单词转换为单个字符组成的list
    samples = [c for c in prime]
    # sampling=True意味着batch的size=1 x 1
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        # 加载模型参数，恢复训练
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            # 输入单个字符
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        # 添加字符到samples中
        samples.append(int_to_vocab[c])

        # 不断生成字符，直到达到指定数目
        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])

    return ''.join(samples)

# 选用最终的训练参数作为输入进行文本生成
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="The")
print(samp)

训练步数：3960

Ther
stepping in the service, to the sound of the word of his hands.

“You suppose, I has not the ball of me, I’ll go to the sort of more the
meaning of the carriage, and then that this can’t be done to be seet that
it had all to be anyone when he carefully we that she had seen it; I cannot
give mind at their conversation. And then that I will come to me a clear
onesely to see them, and so, then all the way of hope and they’re not too
too.”

“No, I don’t say what a marriage, I should have to be a conversation. They
were an elect tree.”

“Why, have thought you and I said nothing?…”

“Yes, yes. And they’ve always asked him. It’s all alone,” asked the old
passionate tone.

And she was completely for the peasants.

The meanon of hissers, as though he worred to begin him,” answered the
look and preparations, but a crange of his still seeing her, she went on
to the carriage.

They thought of his brother at his blood that saw her sister had been
at once an idoor that their harress and any answer, though he had
standing it, because he was the profissor of the world and through
her terrible one along the carriages.

“Yes, but I could shall have been at his hass. Here, to be so than that
that’s something at all.”

“Whare an offer to see it. You should have been in the same. But it’s
not much to her; I stayed it with you…. Yes, that is a concollent
dogror in,” was a planfor and word, still he was not all that there was
staying, to help angining to stay to a letter, than the same something
was ashemed to be seen. He could not string her so all the carriages of
the searow, and whe with his baby still mother. And the conversation,
her face, with the state words of her soul satideat.

Sergey Ivanovitch had, and the marriage they struggland at the time
without her support, and then he was conscious of his first-raid.

After the crowd and went into the room to the party and too, he had
asked for the certainty to believe it, would have said that he couse has
a subjict, had been so so