自然语言处理-应用场景-文本生成：GRU --＞故事续写【喂给模型一段话，模型自动生成接下来的话，类似闲聊机器人】NLP领域最具挑战性的任务之一【温度参数、从预测分布中抽取预测样本】

最新推荐文章于 2025-05-20 22:52:03 发布

u013250861

最新推荐文章于 2025-05-20 22:52:03 发布

阅读量960

点赞数

文章标签：人工智能深度学习自然语言处理 NLP 文本生成

本文链接：https://blog.csdn.net/u013250861/article/details/114296984

版权

本文介绍了一个使用GRU模型进行文本生成的任务，特别是如何利用莎士比亚剧本数据集训练模型，以及如何通过模型生成符合莎士比亚风格的新文本。详细步骤包括数据预处理、模型构建与训练、以及模型应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在这里插入图片描述
这是一项使用GRU模型的文本生成任务，文本生成任务是NLP领域最具有挑战性的任务之一。

以一段文本或字符为输入，使用模型预测之后可能出现的文本内容，我们希望这些文本内容符合语法并能保持语义连贯性。

到目前为止，NLP文本生成还是一项艰巨的任务。

从实用角度出发，NLP文本生成模型更多地是尝试在与艺术类文本相关的任务中应用。在与科研、新闻稿等相关的领域，NLP文本生成模型使用不多(因为其严谨度还不够)。

当前案例就是使用莎士比亚的剧本作为原始数据。

一、莎士比亚作品数据集

数据下载地址：https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

数据集预览:

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

在这里插入图片描述

二、使用GRU模型实现文本生成任务的步骤

第一步: 下载数据集并做文本预处理
第二步: 构建模型并训练模型、保存模型
第三步: 使用模型生成文本内容

1、第一步: 下载数据集并做文本预处理

1.1 下载数据

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

print("Tensorflow Version:", tf.__version__)	# 打印tensorflow版本
# 一、下载数据集
# 1、下载数据
path_to_file = tf.keras.utils.get_file(fname='shakespeare.txt', cache_dir='./', origin='https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')  # 使用tf.keras.utils.get_file方法从指定地址下载数据，得到原始数据本地路径
print("path_to_file = {0}".format(path_to_file))

输出结果：

Tensorflow Version: 2.1.0-rc2
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
1122304/1115394 [==============================] - 0s 0us/step

1.2 读取数据

text = open(path_to_file, 'rb').read().decode(encoding='utf-8')  # 打开原始数据文件并读取文本内容
print("text[:250] = \n{0}".format(text[:250]))
print('文件总字符数量: len(text) = {0}'.format(len(text)))  # 统计字符个数
vocab = sorted(set(text))  # 统计文本中非重复字符数量
print("vocab = {0}".format(vocab))
print('文本中非重复字符数量 = {0}'.format(len(vocab)))
print("-" * 200)

输出结果：

text[:250] = 
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
文件总字符数量: len(text) = 1115394
vocab = ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
文本中非重复字符数量 = 65

1.3 对文本进行数值映射

# 对字符进行数值映射，将创建两个映射表：字符映射成数字，数字映射成字符
char2idx = {
   item: index for index, item in enumerate(vocab)}
print("char2idx = {0}".format(char2idx))
idx2char = np.array(vocab)
print("idx2char = {0}".format(idx2char))
text_as_int = np.array([char2idx[c] for c in text])  # 使用字符到数字的映射表示所有文本
print("text_as_int = {0}".format(text_as_int))
print('Characters mapped to int：{} ---- > {}'.format(repr(text[:13]), text_as_int[:13]))  # 查看原始语料前13个字符映射后的结果
print("-" * 200)

输出结果：

char2idx = {
   '\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
idx2char = ['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
text_as_int = [18 47 56 ... 45  8  0]
Characters mapped to int：'First Citizen' ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]

1.4 构建训练数据

对于原始文本，人工定义输入序列长度seq_length，每个输入序列与其对应的目标序列等长度，但是向右顺移一个字符。如：设定输入序列长度seq_length为4，针对文本hello来讲，得到的训练数据为：输入序列“hell”，目标序列为“ello”.

seq_length = 100  # 设定输入序列长度【句子长度】
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)  # 将数值映射后的文本转换成dataset对象方便后续处理【from_tensor_slices作用: 切分传入Tensor的第一个维度，生成相应的dataset】 <class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
print("len(char_dataset) = {0}".format(len(char_dataset)))  # 1115394
for i in char_dataset.take(5): print("char_dataset 第{0}个字符：{1}".format(i, idx2char[i.numpy()]))  # 通过char_dataset的take方法以及映射表查看前5个字符
sequence_batches = char_dataset.batch(seq_length + 1, drop_remainder=True)  # 使用dataset的batch方法按照字符长度+1划分（要留出一个向后顺移的位置）【drop_remainder=True表示删除掉最后一批可能小于批次数量的数据】<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
print("sequence_batches 中的batch数量: len(sequence_batches) = {0}".format(len(sequence_batches)))  # 11043=1115394/101
for item in sequence_batches.take(1):
    print("item = {0}".format(item))
    print("item.numpy() = {0}".format(item.numpy()))
    print("idx2char[item.numpy()] = {0}".format(idx2char[item.numpy()]))
    print("repr(''.join(idx2char[item.numpy()])) = {0}".format(repr(''.join(idx2char[item.numpy()]))))


def split_input_target(chunk):  # 划分输入序列和目标序列函数【前100个字符为输入序列，第二个字符开始到最后为目标序列】
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text


dataset_train = sequence_batches.map(split_input_target)  # 使用map方法调用该函数对每条序列进行划分
print("dataset_train = {0}".format(dataset_train))
print("-" * 200)

for input_example, target_example in dataset_train.take(1):  # 查看划分后的第一批次结果
    print('输入文本: ', repr(''.join(idx2char[input_example.numpy()])))
    print('目标文本:', repr(''.join(idx2char[target_example.numpy()])))

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):  # 查看将要输入模型中的每个时间步的输入和输出(以前五步为例)【循环每个字符，并打印每个时间步对应的输入和输出】
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
print("-" * 200)

输出结果：

len(char_dataset) = 1115394
char_dataset 第18个字符：F
char_dataset 第47个字符：i
char_dataset 第56个字符：r
char_dataset 第57个字符：s
char_dataset 第58个字符：t

sequence_batches 中的batch数量: len(sequence_batches) = 11043

item = [18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1]

item.numpy() = [18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49