使用RNN生成文本（Colab_Tensorflow案列_翻译）_import os import tensorflow as tf import numpy as -CSDN博客

本教程介绍如何利用基于字符的RNN模型，使用莎士比亚作品数据集生成文本。通过训练，模型能够预测序列中的下一个字符，虽然输出的文本结构类似剧本，但大多数缺乏意义。教程涵盖了从数据预处理到模型训练和文本生成的完整流程。

摘要由CSDN通过智能技术生成

使用RNN生成文本

英文原版链接
本教程演示了如何使用基于字符的RNN生成文本。我们将使用Andrej Karpathy的《循环神经网络的不合理有效性》中的莎士比亚作品数据集。给定来自此数据的字符序列（“Shakespear”），训练模型以预测序列中的下一个字符（“ e”）。通过重复调用模型，可以生成更长的文本序列。
本教程包括使用tf.keras以及eager execution实现的可运行代码。以下是本教程中的模型训练了30个epoch并以字符串“ Q”开头时的示例输出：

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

尽管有些句子是有语法性的，但大多数没有意义。该模型尚未学习单词的含义，但是：

该模型是基于字符的。训练开始时，模型不知道如何拼写英文单词，或者说单词是文本的一个单元。
输出的结构类似于剧本-文本块通常以说话者姓名开头，所有大写字母与数据集相似。
如下所示，该模型在小批文本（每个100个字符）上训练，并且仍然能够生成具有连贯结构的较长文本序列。

设定

导入TensorFlow和其他库

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import numpy as np
import os
import time

下载莎士比亚数据集

根据需要自行更改文件名。

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

从https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt下载数据
1122304/1115394 [==============================]-0s 0us / step

读取数据

首先，查看文本：

# 读取并解码
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# 文本长度是其中的字符数。
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters

# 查看文本中的前250个字符
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

# 文本中的独特（不重复）字符，使用集合处理
vocab = sorted(set(text))     
print ('{} unique characters'.format(len(vocab)))

65 unique characters

处理文本

文本向量化

训练之前，我们需要将字符串映射到数字表示形式。创建两个查找表：一个将字符映射到数字，另一个将数字映射到字符。

char2idx = {
   u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

#将文本转换为对应整数的形式
text_as_int = np.array([char2idx[c] for c in text])

现在，每个字符都有一个整数表示形式。请注意，我们将字符从0到的索引len(unique)进行了映射。

print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

repr()：将对象转化为供解释器读取的形式。

{
  '&' :   4,
  'E' :  17,
  't' :  58,
  'x' :  62,
  'F' :  18,
  'T' :  32,
  'C' :  15,
  'V' :  34,
  'z' :  64,
  '-' :   7,
  'd' :  42,
  '\n':   0,
  'u' :  59,
  'Y' :  37,
  'p' :  54,
  ' ' :   1,
  'q' :  55,
  'P' :  28,
  'o' :  53,
  'm' :  51,
  ...
}

'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]

预测任务

给定一个字符或一个字符序列，最可能的下一个字符是什么？这是我们正在训练模型执行的任务。模型的输入将是字符序列，我们训练模型以预测输出-每个时间步长后面的字符。

由于RNN保持一个内部状态，该状态取决于先前看到的元素，给定直到此刻为止计算的所有字符，下一个字符是什么？

创建培训示例和目标

接下来，将文本分成示例序列。每个输入序列将包含seq_length文本中的字符。

对于每个输入序列，除了向右移动一个字符外，相应的目标包含相同长度的文本。

因此，将文本分成几个大块seq_length+1。例如，say seq_length为4，我们的文本为“ Hello”。输入序列为“ Hell”，目标序列为“ ello”。

为此，首先使用tf.data.Dataset.from_tensor_slices函数将文本向量转换为字符索引流。

# 我们想要的单个以字符形式输入的最大句子长度
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# 转换为tensor
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

#print转换后的前五个元素
for i in char_dataset.take(5):