TensorFlow循环神经网络样例:预测语句下文——基于循环神经网络的神经语言模型

使用 TensorFlow 实现神经语言模型,自然语言建模可抽象为 embedding 层 + 循环神经网络层 + Softmax 层 的结构,可以看到,与循环神经网络相比,NLP应用中多了两个层:词向量层(embedding层)和Softmax层。

鉴于犯懒,理论暂时不在此编辑,后续有时间再补充,只记录一下代码和结果。

采用 PTB 数据集实现的训练过程,原始数据集需要经过前处理,犯懒暂时先不在此编辑,日后补充,有需要原始数据集和处理好的数据集的小伙伴可以留言。

# coding: utf-8
"""
@ File:     ptb_rnn_train.py
@ Brief:    语句预测,基于 PTB 数据集与 RNN 深层循环神经网络
@ Envs:     Python 3.6 + TensorFlow 1.14, win10 + pycharm
@ Author:   攀援的井蛙
@ Date:     2020-09-15
"""

import numpy as np
import tensorflow as tf


# 设置参数
TRAIN_DATA = "./ptb_data/ptb.train"   # 训练数据路径
EVAL_DATA = "./ptb_data/ptb.valid"    # 验证数据路径
TEST_DATA = "./ptb_data/ptb.test"     # 测试数据路径

HIDDEN_SIZE = 300               # 隐藏层规模
NUM_LAYERS = 2                  # 深层循环神经网络中 LSTM 结构的层数
VOCAB_SIZE = 10000              # 词典规模
TRAIN_BATCH_SIZE = 20           # 训练数据 batch 的大小
TRIAN_NUM_STEP = 35             # 训练数据截断长度

EVAL_BATCH_SIZE = 1             # 测试数据 batch 大小
EVAL_NUM_STEP = 1               # 测试数据截断长度
NUM_EPOCH = 5                   # 使用训练数据的轮数
LSTM_KEEP_PROB = 0.9            # LSTM 节点不被 dropout 的概率
EMBEDDING_KEEP_PROB = 0.9       # 词向量不被 dropout 的概率
MAX_GRAD_NORM = 5               # 用于控制梯度膨胀的梯度大小上限
SHARE_EMB_AND_SOFTMAX = True    # 在 softmax 层和词向量层之间共享参数



'''
@ brief: 定义模型,通过一个 PTBModel 类来描述模型,方便维护循环神经网络中的状态。
'''
class PTBModel(object):
    '''
    @ brief: 基于 LSTM 的 RNN 模型搭建,定义模型训练过程
    @ return: None
    @ param is_traing: 是否是训练过程
    @ param batch_size: 输入数据 batch 大小
    @ param num_steps: 输入数据截断长度
    '''
    def __init__(self, is_training, batch_size, num_steps):
        # 记录使用的 batch 大小和截断长度
        self.batch_size = batch_size
        self.num_steps = num_steps

        # 定义每一步的输入和预期输出。两者的维度都是 [batch_size, num_steps]
        self.input_data = tf.placeholder(tf.int32, [batch_size, num_steps])
        self.targets = tf.placeholder(tf.int32, [batch_size,num_steps])

        # 定义使用 LSTM 结构为循环体结构且使用 droupout 的深层循环神经网络
        dropout_keep_prob = LSTM_KEEP_PROB if is_training else 1.0
        lstm_cells = [
            tf.nn.rnn_cell.DropoutWrapper(
                tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE),
                output_keep_prob=dropout_keep_prob
            )
            for _ in range(NUM_LAYERS)
        ]
        cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)

        # 初始化最初的状态,即全零的向量。这个量只在每个 epoch  初始化第一个 batch 时使用
        self.initial_state = cell.zero_state(batch_size, tf.float32)

        # 定义单词的词向量矩阵
        embedding = tf.get_variable("embedding", [VOCAB_SIZE, HIDDEN_SIZE])

        # 将输入单词转化为词向量
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)

        # 只在训练时使用词向量
        if is_training:
            inputs = tf.nn.dropout(inputs, EMBEDDING_KEEP_PROB)

        # 定义输出层。这里先将不同时刻 LSTM 结构的输出收集起来,再一起提供给 softmax 层
        outputs =[]
        state = self.initial_state
        with tf.variable_scope("RNN"):
            for time_step in range(num_steps):
                if time_step > 0 : tf.get_variable_scope().reuse_variables()
                cell_output, state = cell(inputs[:, time_step, :], state)
                outputs.append(cell_output)
        # 把输出队列展开成 [batch, hidden_size * num_steps] 的形状,
        # 然后再 reshape 成 [batch * numsteps, hidden_size] 的形状
        output = tf.reshape(tf.concat(outputs, 1), [-1, HIDDEN_SIZE])

        # softmax 层: 将 RNN 在每个位置上的输出转化为各个单词的 logits
        if SHARE_EMB_AND_SOFTMAX:
            weight = tf.transpose(embedding)
        else:
            weight = tf.get_variable("weight", [HIDDEN_SIZE, VOCAB_SIZE])
        bias = tf.get_variable("bias", [VOCAB_SIZE])
        logits = tf.matmul(output, weight) + bias

        # 定义交叉熵损失函数和平均损失
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels = tf.reshape(self.targets, [-1]),
            logits = logits
        )
        self.cost = tf.reduce_sum(loss) / batch_size
        self.final_state = state

        # 只在训练模型时定义反向传播操作
        if not is_training: return

        trainable_variables = tf.trainable_variables()
        # 控制梯度大小,定义优化方法和训练步骤
        grads, _ = tf.clip_by_global_norm(
            tf.gradients(self.cost, trainable_variables),
            MAX_GRAD_NORM
        )
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        self.train_op = optimizer.apply_gradients(zip(grads, trainable_variables))



''' 
@ brief: 使用给定的模型 model 在数据 data 上运行 train_op 并返回在全部数据上的 perplexity 值
@ return: 给定模型在给定数据上的 perplexity 值
@ param session: 计算图
@ param model: 模型
@ param batches: batch 数据
@ param train_op: 模型优化方法
@ param output_log: 是否打印日志
@ param step: 截断长度
'''
def run_epoch(session, model, batches, train_op, output_log, step):
    # 计算平均 perplexity 的辅助变量
    total_costs = 0.0
    iters = 0
    state = session.run(model.initial_state)
    # 训练一个 epoch
    for x, y in batches:
        # 在当前 batch 上运行 train_op 并计算损失值。
        # 交叉熵损失函数计算的就是下一个单词为给定的单词的概率
        cost, state, _ = session.run(
            [model.cost, model.final_state, train_op],
            {
                model.input_data: x,
                model.targets: y,
                model.initial_state: state
             }
        )
        total_costs += cost
        iters += model.num_steps

        # 只有在训练的时候输出日志
        if output_log and step % 100 == 0:
            print("After %d steps, perplexity is %.3f" % (
                step, np.exp(total_costs / iters)
            ))
        step += 1

    # 返回给定模型在给定数据上的 perplexity 值
    return step, np.exp(total_costs / iters)



'''
@ brief: 从文件中读取数据,并返回包含单词编号的数组
@ return: 单词编号列表
@ param file_path: 单词文档路径
'''
def read_data(file_path):
    with open(file_path, "r") as fin:
        # 将整个文档读进一个长字符串
        id_string = ' '.join([line.strip() for line in fin.readlines()])
    id_list = [int(w) for w in id_string.split()]   # 将读取的单词编号为整数
    return id_list


'''
@ brief: 将长 batch 转化为模型可接受的 batch 数据
@ return: 长度为 num_batches 的数组,其中每一项包括一个 data 矩阵和一个 label 矩阵
@ param id_list: 单词编号列表
@ param batch_size: 所需要的 batch size
@ param num_step: 截断长度
'''
def make_batches(id_list, batch_size, num_step):
    # 计算总的 batch 数量。每个 batch 包含的单词数量是 batch_size * num_step
    num_batches = (len(id_list) - 1) // (batch_size * num_step)

    # 将数据整理成一个维度为 [batch_size, num_batches * num_step] 的二维数组
    data = np.array(id_list[: num_batches * batch_size * num_step])
    data = np.reshape(data, [batch_size, num_batches * num_step])
    #沿第二个维度将数据切分成 num_batches 个 batch, 存入一个数组
    data_batches = np.split(data, num_batches, axis=1)

    # 重复上述步骤,但是每个位置向右移动一位。这里得到的是 RNN 每一步输出所需要预测的下一个单词
    label = np.array(id_list[1: num_batches * batch_size * num_step +1])
    label = np.reshape(label, [batch_size, num_batches * num_step])
    label_batches = np.split(label, num_batches, axis=1)
    # 返回一个长度为 num_batches 的数组,其中每一项包括一个 data 矩阵和一个 label 矩阵
    return list(zip(data_batches, label_batches))



''' 
@ brief: 主函数,进行模型训练与验证
@ return: None
'''
def main():
    # 定义初始化函数
    initializer = tf.random_uniform_initializer(-0.05, 0.05)

    # 定义训练用的循环神经网络模型
    with tf.variable_scope("language_model",
                           reuse=None, initializer=initializer):
        train_model = PTBModel(True, TRAIN_BATCH_SIZE, TRIAN_NUM_STEP)

    # 定义测试用的循环神经网络模型。它与 train_model 共用参数,但是没有 dropout
    with tf.variable_scope("language_model",
                           reuse=True, initializer=initializer):
        eval_model = PTBModel(False, EVAL_BATCH_SIZE, EVAL_NUM_STEP)

    # 训练模型
    with tf.Session() as session:
        tf.global_variables_initializer().run()
        train_batches = make_batches(
            read_data(TRAIN_DATA), TRAIN_BATCH_SIZE, TRIAN_NUM_STEP
        )
        eval_batches = make_batches(
            read_data(EVAL_DATA), EVAL_BATCH_SIZE, EVAL_NUM_STEP
        )
        test_batches = make_batches(
            read_data(TEST_DATA), EVAL_BATCH_SIZE, EVAL_NUM_STEP
        )

        step = 0
        for i in range(NUM_EPOCH):
            print("In iteration: %d" % (i + 1))
            step, train_pplx = run_epoch(session, train_model, train_batches,
                                         train_model.train_op, True, step)
            print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_pplx))

            _, eval_pplx = run_epoch(session, eval_model, eval_batches,
                                     tf.no_op(), False, 0)
            print("Epoch: %d Eval Perplexity: %.3f" % (i + 1, eval_pplx))

        _, test_pplx = run_epoch(session, eval_model, test_batches,
                                 tf.no_op(), False, 0)
        print("Test Perplexity: %.3f" % test_pplx)



if __name__ == "__main__":
    main()



''' 
In iteration: 1
After 0 steps, perplexity is 10075.574
After 100 steps, perplexity is 1747.968
After 200 steps, perplexity is 1177.165
After 300 steps, perplexity is 928.727
After 400 steps, perplexity is 763.889
After 500 steps, perplexity is 651.727
After 600 steps, perplexity is 576.014
After 700 steps, perplexity is 517.666
After 800 steps, perplexity is 466.301
After 900 steps, perplexity is 428.511
After 1000 steps, perplexity is 401.254
After 1100 steps, perplexity is 373.934
After 1200 steps, perplexity is 352.589
After 1300 steps, perplexity is 332.118
Epoch: 1 Train Perplexity: 328.882
Epoch: 1 Eval Perplexity: 186.893
In iteration: 2
After 1400 steps, perplexity is 177.436
After 1500 steps, perplexity is 163.493
After 1600 steps, perplexity is 166.173
After 1700 steps, perplexity is 163.144
After 1800 steps, perplexity is 158.316
After 1900 steps, perplexity is 156.327
After 2000 steps, perplexity is 154.582
After 2100 steps, perplexity is 149.846
After 2200 steps, perplexity is 146.841
After 2300 steps, perplexity is 145.581
After 2400 steps, perplexity is 143.259
After 2500 steps, perplexity is 140.378
After 2600 steps, perplexity is 136.973
Epoch: 2 Train Perplexity: 136.399
Epoch: 2 Eval Perplexity: 133.243
In iteration: 3
After 2700 steps, perplexity is 119.556
After 2800 steps, perplexity is 105.467
After 2900 steps, perplexity is 112.324
After 3000 steps, perplexity is 110.218
After 3100 steps, perplexity is 109.081
After 3200 steps, perplexity is 109.015
After 3300 steps, perplexity is 108.384
After 3400 steps, perplexity is 106.450
After 3500 steps, perplexity is 104.537
After 3600 steps, perplexity is 104.121
After 3700 steps, perplexity is 103.996
After 3800 steps, perplexity is 102.056
After 3900 steps, perplexity is 100.190
Epoch: 3 Train Perplexity: 99.861
Epoch: 3 Eval Perplexity: 116.491
In iteration: 4
After 4000 steps, perplexity is 99.503
After 4100 steps, perplexity is 83.888
After 4200 steps, perplexity is 89.210
After 4300 steps, perplexity is 89.028
After 4400 steps, perplexity is 88.194
After 4500 steps, perplexity is 87.713
After 4600 steps, perplexity is 87.379
After 4700 steps, perplexity is 86.564
After 4800 steps, perplexity is 85.171
After 4900 steps, perplexity is 84.778
After 5000 steps, perplexity is 85.033
After 5100 steps, perplexity is 83.678
After 5200 steps, perplexity is 82.771
After 5300 steps, perplexity is 82.330
Epoch: 4 Train Perplexity: 82.312
Epoch: 4 Eval Perplexity: 109.871
In iteration: 5
After 5400 steps, perplexity is 73.452
After 5500 steps, perplexity is 74.664
After 5600 steps, perplexity is 77.932
After 5700 steps, perplexity is 75.798
After 5800 steps, perplexity is 74.666
After 5900 steps, perplexity is 74.827
After 6000 steps, perplexity is 74.854
After 6100 steps, perplexity is 73.581
After 6200 steps, perplexity is 73.301
After 6300 steps, perplexity is 73.792
After 6400 steps, perplexity is 73.176
After 6500 steps, perplexity is 72.492
After 6600 steps, perplexity is 71.634
Epoch: 5 Train Perplexity: 71.809
Epoch: 5 Eval Perplexity: 107.415
Test Perplexity: 103.890

Process finished with exit code 0

'''

结果比预期要好,从 perplexity 可以看出,训练前的预测是从 10000+ 词汇中找一个, 蒙对的概率是万分之一左右,训练完成后蒙对的概率上升到 百分之一左右。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值