tf.data加载文本

最新推荐文章于 2022-05-22 13:40:16 发布

ynchyong

最新推荐文章于 2022-05-22 13:40:16 发布

阅读量264

点赞数

分类专栏：机械学习文章标签： tensorflow 加载文本

本文链接：https://blog.csdn.net/ynchyong/article/details/109852966

版权

机械学习专栏收录该内容

22 篇文章 1 订阅

订阅专栏

tf.data加载文本

代码

# -*- coding: utf-8 -*-
"""
Created on  2020/11/20 15:31
@Author: CY
@email: 5844104706@qq.com
"""
import tensorflow as tf

import tensorflow_datasets as tfds
import os
# 目标 将使用相同作品（荷马的伊利亚特）三个不同版本的英文翻译，然后训练一个模型来通过单行文本确定译者

# 使用 tf.data.TextLineDataset 来加载文本文件的示例。TextLineDataset 通常被用来以文本文件构建数据集（原文件中的一行为一个样本)
# 数据集 文本文件已经进行过一些典型的预处理，主要包括删除了文档页眉和页脚，行号，章节标题。
# 请下载这些已经被局部改动过的文件
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = os.path.dirname(text_dir)

print("数据所在文件夹：", parent_dir)


# 迭代整个文件，将整个文件加载到自己的数据集中。
# 每个样本都需要单独标记，所以请使用 tf.data.Dataset.map 来为每个样本设定标签。这将迭代数据集中的每一个样本并且返回（ example, label ）对。
def labeler(example, index):
    return example, tf.cast(index, tf.int64)


labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

# 可以使用 tf.data.Dataset.take 与 print 来查看 (example, label) 对的外观。numpy 属性显示每个 Tensor 的值
for ex in all_labeled_data.take(5):
    print(ex)

print("将文本编码成数字")
# 机器学习基于的是数字而非文本，所以字符串需要被转化成数字列表。 为了达到此目的，我们需要构建文本与整数的一一映射
# 通过将文本标记为单独的单词集合来构建词汇表   TensorFlow 和 Python 中均有很多方法来达成这一目的
#### 迭代每个样本的 numpy 值。
#### 使用 tfds.features.text.Tokenizer 来将其分割成 token。
#### 将这些 token 放入一个 Python 集合中，借此来清除重复项。
#### 获取该词汇表的大小以便于以后使用。

#tokenizer = tfds.features.text.Tokenizer() #该API已被弃用
tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
print('词汇表大小：', vocab_size)

print("#样本编码")
#通过传递 vocabulary_set 到 tfds.features.text.TokenTextEncoder 来构建一个编码器。
# 编码器的 encode 方法传入一行文本，返回一个整数列表。
#encoder = tfds.features.text.TokenTextEncoder(vocabulary_set) #该API已被弃用
encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)
example_text = next(iter(all_labeled_data))[0].numpy()
print('原文：',example_text)
encoded_example = encoder.encode(example_text)
print('编码：',encoded_example)

print("在数据集上运行编码器（通过将编码器打包到 tf.py_function 并且传参至数据集的 map 方法的方式来运行")
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

def encode_map_fn(text, label):
    # py_func doesn't set the shape of the returned tensors.
    encoded_text, label = tf.py_function(encode,
                                         inp=[text, label],
                                         Tout=(tf.int64, tf.int64))

    # `tf.data.Datasets` work best if all components have a shape set
    #  so set the shapes manually:
    encoded_text.set_shape([None])
    label.set_shape([])

    return encoded_text, label


all_encoded_data = all_labeled_data.map(encode_map_fn)
print("#将数据集分割为测试集和训练集且进行分支")
#使用 tf.data.Dataset.take 和 tf.data.Dataset.skip 来建立一个小一些的测试数据集和稍大一些的训练数据集。
#在数据集被传入模型之前，数据集需要被分批。最典型的是，每个分支中的样本大小与格式需要一致。
# 但是数据集中样本并不全是相同大小的（每行文本字数并不相同）。因此，使用 tf.data.Dataset.padded_batch（而不是 batch ）将样本填充到相同的大小。
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)
# 现在，test_data 和 train_data 不是（ example, label ）对的集合，而是批次的集合。
# 每个批次都是一对（多样本, 多标签 ），表示为数组。
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]
# 由于我们引入了一个新的 token 来编码（填充零），因此词汇表大小增加了一个。
vocab_size += 1

print("建模：")
model = tf.keras.Sequential()
#第一层将整数表示转换为密集矢量嵌入。更多内容请查阅 Word Embeddings 教程。
model.add(tf.keras.layers.Embedding(vocab_size, 64))
#下一层是 LSTM 层，它允许模型利用上下文中理解单词含义。 LSTM 上的双向包装器有助于模型理解当前数据点与其之前和之后的数据点的关系。
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
#最后，我们将获得一个或多个紧密连接的层，其中最后一层是输出层。输出层输出样本属于各个标签的概率，最后具有最高概率的分类标签即为最终预测结果。
# 一个或多个紧密连接的层
# 编辑 `for` 行的列表去检测层的大小
for units in [64, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))

# 输出层。第一个参数是标签个数。
model.add(tf.keras.layers.Dense(3, activation='softmax'))

#最后，编译这个模型。对于一个 softmax 分类模型来说，
# 通常使用 sparse_categorical_crossentropy 作为其损失函数。你可以尝试其他的优化器，
# 但是 adam 是最常用的。
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("#训练模型")
#利用提供的数据训练出的模型有着不错的精度（大约 83% ）。
model.fit(train_data, epochs=3, validation_data=test_data)
eval_loss, eval_acc = model.evaluate(test_data)
print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))

执行结果



数据所在文件夹： C:\Users\Administrator\.keras\datasets
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
                                                                                                                                 (<tf.Tensor: shape=(), dtype=string, numpy=b"May bury him, and to his mem'ry raise">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The reins attaching to the chariot-rail,'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'And brisk in fight Oresbius; rich was he,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The son of Phylacus; these two in arms'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Patroclus: whom I never can forget,'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
将文本编码成数字
词汇表大小： 17178
#样本编码
原文： b"May bury him, and to his mem'ry raise"
编码： [1837, 2380, 16078, 12233, 2267, 7486, 7893, 17003, 1487]
在数据集上运行编码器（通过将编码器打包到 tf.py_function 并且传参至数据集的 map 方法的方式来运行
#将数据集分割为测试集和训练集且进行分支
建模：
#训练模型
Epoch 1/3
697/697 [==============================] - 21s 31ms/step - loss: 0.5098 - accuracy: 0.7559 - val_loss: 0.3877 - val_accuracy: 0.8248
Epoch 2/3
697/697 [==============================] - 19s 28ms/step - loss: 0.2940 - accuracy: 0.8698 - val_loss: 0.3612 - val_accuracy: 0.8364
Epoch 3/3
697/697 [==============================] - 18s 26ms/step - loss: 0.2186 - accuracy: 0.9045 - val_loss: 0.3686 - val_accuracy: 0.8418
79/79 [==============================] - 1s 17ms/step - loss: 0.3686 - accuracy: 0.8418

Eval loss: 0.36862486600875854, Eval accuracy: 0.8417999744415283

ynchyong

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
tf.data加载文本

tf.data加载文本代码# -*- coding: utf-8 -*-"""Created on 2020/11/20 15:31@Author: CY@email: 5844104706@qq.com"""import tensorflow as tfimport tensorflow_datasets as tfdsimport os# 目标将使用相同作品（荷马的伊利亚特）三个不同版本的英文翻译，然后训练一个模型来通过单行文本确定译者# 使用 tf.data.TextL
复制链接

扫一扫