Tensorflow2.0学习-加载和预处理数据 (七)

最新推荐文章于 2024-07-23 13:31:07 发布

赫凯

最新推荐文章于 2024-07-23 13:31:07 发布

阅读量1.5k

点赞数

分类专栏： # Tensorflow2.0 文章标签：学习 tensorflow python

本文链接：https://blog.csdn.net/u010095372/article/details/124519459

版权

Tensorflow2.0 专栏收录该内容

19 篇文章 6 订阅

订阅专栏

本文详细介绍了如何使用TensorFlow处理各种数据类型，包括图像数据的加载、格式化、预处理，CSV数据的读取、预处理，以及文本数据的处理。通过实例展示了如何构建数据集、模型并进行训练，同时提到了TFRecord的使用方法，用于高效存储和读取数据。

摘要由CSDN通过智能技术生成

图像

官方图像

引包

import tensorflow as tf
AUTOTUNE = tf.data.experimental.AUTOTUNE

数据准备

下载数据

import pathlib
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                         fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)

随机打印数据

for item in data_root.iterdir():
    print(item)

import random
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
random.shuffle(all_image_paths)

image_count = len(all_image_paths)
print(image_count)

查看图片

原来的网页是用了IPython包，但是得用jupyter才能显示，这里我改成matplotlib

import matplotlib.pyplot as plt  # plt 用于显示图片
from PIL import Image


def display(path):
    img = Image.open(path)
    plt.imshow(img)
    plt.show()


def caption_image(image_path):
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel).replace('\\', '/')].split(' - ')[:-1])


for n in range(3):
    image_path = random.choice(all_image_paths)
    display(image_path)
    print(caption_image(image_path))
    print()

有一说一，这些个花的照片还蛮好看的。

图对应的标签

将图对应的标签打印出来，再给他们附上数字编号。

label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
print(label_names)

label_to_index = dict((name, index) for index, name in enumerate(label_names))
print(label_to_index)

再将每个图片的标签值对应上

all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

print("First 10 labels indices: ", all_image_labels[:10])

现在标签数组all_image_labels和图片路径数组all_image_paths都有了。

加载格式化图片

这个就可以将图片格式化，标准大小以及归一化。

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [192, 192])
  image /= 255.0  # normalize to [0,1] range

  return image

def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)

import matplotlib.pyplot as plt

image_path = all_image_paths[0]
label = all_image_labels[0]

plt.imshow(load_and_preprocess_image(img_path))
plt.grid(False)
plt.xlabel(caption_image(img_path))
plt.title(label_names[label].title())
print()

tf.data.Dataset

来看看官方的加载工具吧。
首先将所有图片的路径押进TensorSliceDataset里。再用map动态加载格式化图片。

path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)

import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
for n, image in enumerate(image_ds.take(4)):
  plt.subplot(2,2,n+1)
  plt.imshow(image)
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  plt.xlabel(caption_image(all_image_paths[n]))
  plt.show()

既然图片可以，那么标签也可以

label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))
for label in label_ds.take(10):
  print(label_names[label.numpy()])

然后用zip打包起来，这样image_label_ds，出来就有图和标签了。

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

注意：当你拥有形似 all_image_labels 和 all_image_paths的数组，tf.data.dataset.Dataset.zip 的替代方法是将这对数组切片。

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

# 元组被解压缩到映射函数的位置参数中
def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)
image_label_ds

跑起来

数据集训练参数设置

BATCH_SIZE = 32

# 设置一个和数据集大小一致的 shuffle buffer size（随机缓冲区大小）以保证数据
# 被充分打乱。
ds = image_label_ds.shuffle(buffer_size=image_count)  # 将数据进行顺序打乱
ds = ds.repeat()  # 可以重复选取
ds = ds.batch(BATCH_SIZE)  # 被切割为batch
# 当模型在训练的时候，`prefetch` 使数据集在后台取得 batch。
ds = ds.prefetch(buffer_size=AUTOTUNE)

在随机缓冲区完全为空之前，被打乱的数据集不会报告数据集的结尾。Dataset（数据集）由 .repeat重新启动，导致需要再次等待随机缓冲区被填满。

可以通过使用 tf.data.Dataset.apply 方法和融合过的tf.data.experimental.shuffle_and_repeat 函数来解决

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

传入模型中

直接就是存在tf.keras.applications里的一个副本，这个MobileNetV2是不可训练的

mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False

看下网络传入数据的要求

help(tf.keras.applications.mobilenet_v2.preprocess_input)

要变成（-1，1）

def change_range(image,label):
  return 2*image-1, label

keras_ds = ds.map(change_range)

来一个批次先看看

# 数据集可能需要几秒来启动，因为要填满其随机缓冲区。
image_batch, label_batch = next(iter(keras_ds))
feature_map_batch = mobile_net(image_batch)
print(feature_map_batch.shape)

结果

(32, 6, 6, 1280)

再处理，做自己的模型，再输出看看

model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names), activation = 'softmax')])

logit_batch = model(image_batch).numpy()

print("min logit:", logit_batch.min())
print("max logit:", logit_batch.max())
print()

print("Shape:", logit_batch.shape)

结果

min logit: 0.004120019
max logit: 0.6654783
Shape: (32, 5)

对模型进行一些设置，稍微跑一下

model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=["accuracy"])
model.fit(ds, epochs=1, steps_per_epoch=3)

结果

3/3 [==============================] - 0s 165ms/step - loss: 2.0662 - accuracy: 0.1667
Out[37]: <tensorflow.python.keras.callbacks.History at 0x16f3f3d2ac0>

可以加入缓存，提高训练效率，GPU不同等待CPU填完数据再运行。

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

总结

首先有图片和标签，图片用图片存放地址代替。可以保证一个地址一个标签，用的时候，直接映射读出图片。

# 图片路径和标签分别tf.data.Dataset
path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))

# image_ds 和 label_ds 可以组合一下
image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

# 当然也可以直接两个数组进行tf.data.Dataset
ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

# 元组被解压缩到映射函数的位置参数中
def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)


# 对数据集进行设置就是
ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

CSV

官方CSV，也就是EXCEl格式的文件读取。看了官方的例子，发现英文版和中文版代码都不是一致的，有点意思。

引包

import functools

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

数据准备

下载数据，这个数据集是关于泰坦尼克号幸存者的名单信息。

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
# 让 numpy 数据更易读。
np.set_printoptions(precision=3, suppress=True)

有了数据后，将数据导入到dataset 的构造函数中tf.data.experimental.make_csv_dataset

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # 为了示例更容易展示，手动设置较小的值
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

打印下信息看看，看了下，是一个大字典，里面键值对对应着数组

examples, labels = next(iter(raw_train_data)) # 第一个批次
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)

结果

EXAMPLES: 
 OrderedDict([
('sex', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'female', b'male', b'male', b'male', b'male',
       b'female', b'male', b'female', b'male', b'female'], dtype=object)>),
('age', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([25., 23., 28., 35., 28., 47., 35., 45., 19., 31., 29., 32.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0])>), 
('parch', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])>), 
('fare', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([ 13.   ,  63.358,   7.879, 512.329,   7.896,   9.   ,   7.125,
       164.867,  10.171, 113.275,  27.721,  13.   ], dtype=float32)>), 
('class', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Second', b'First', b'Third', b'First', b'Third', b'Third',
       b'Third', b'First', b'Third', b'First', b'Second', b'Second'],
      dtype=object)>), 
('deck', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'unknown', b'D', b'unknown', b'B', b'unknown', b'unknown',
       b'unknown', b'unknown', b'unknown', b'D', b'unknown', b'unknown'],
      dtype=object)>), 
('embark_town', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Queenstown', b'Cherbourg',
       b'Southampton', b'Southampton', b'Southampton', b'Southampton',
       b'Southampton', b'Cherbourg', b'Cherbourg', b'Southampton'],
      dtype=object)>), 
('alone', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'y', b'n', b'y', b'y', b'y', b'y', b'y', b'n', b'y', b'n', b'n',
       b'y'], dtype=object)>)]) 

LABELS: 
 tf.Tensor([0 1 1 1 0 0 0 1 0 1 0 1], shape=(12,), dtype=int32)

数据预处理

离散的数据

tf.feature_column.indicator_column我的感觉就是可以把分类的列变成onehot的感觉，比如性别列下有男有女，那么就定义男（1，0），女就是（0，1）。

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

连续的浮点数据

如果是浮点数据需要标准化，也就是归一化，要不不同的列数值差的太远了。

现在创建一个数值列的集合。tf.feature_columns.numeric_column API 会使用 normalizer_fn参数。在传参的时候使用 functools.partial，functools.partial 由使用每个列的均值进行标准化的函数构成。

def process_continuous_data(mean, data):
  # 标准化数据
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)

模型准备

预处理层
tf.keras.layers.DenseFeatures的作用就是将列的数据变成单个的Tensor

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

跑起来

train_data = raw_train_data.shuffle(500)
test_data = raw_test_data

model.fit(train_data, epochs=20)

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))

predictions = model.predict(test_data)

# 显示部分结果
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))

Numpy

这个是Pytorch的老朋友了，相当熟悉了NumPy 。

引包

import numpy as np
import tensorflow as tf

数据准备

DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
  train_examples = data['x_train']
  train_labels = data['y_train']
  test_examples = data['x_test']
  test_labels = data['y_test']

出来的数据格式为（6000，28，28），也是用tf.data.Dataset.from_tensor_slices来处理。

train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['sparse_categorical_accuracy'])

跑起来

model.fit(train_dataset, epochs=10)

model.evaluate(test_dataset)

数据格式比较规矩，就也还行。

pandas dataframes

pandas dataframes官方中文例子，就是pandas第三方库来进行修改。

引包

import pandas as pd
import tensorflow as tf

数据准备

官网的链接不好使了，我替换成可以用的链接https://storage.googleapis.com/tf-datasets/titanic/train.csv

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/tf-datasets/titanic/train.csv')
df = pd.read_csv(csv_file)
print(df.head())
print(df.dtypes)

打印信息

   survived     sex   age  ...     deck  embark_town  alone
0         0    male  22.0  ...  unknown  Southampton      n
1         1  female  38.0  ...        C    Cherbourg      n
2         1  female  26.0  ...  unknown  Southampton      y
3         1  female  35.0  ...        C  Southampton      n
4         0    male  28.0  ...  unknown   Queenstown      y

survived                int64
sex                    object
age                   float64
n_siblings_spouses      int64
parch                   int64
fare                  float64
class                  object
deck                   object
embark_town            object
alone                  object

里面有一些字符串的数据，我们将其变为离散数值。

df['sex'] = pd.Categorical(df['sex'])
df['sex'] = df.sex.cat.codes

再打印依次

   survived  sex   age  n_siblings_spouses  ...  class     deck  embark_town alone
0         0    1  22.0                   1  ...  Third  unknown  Southampton     n
1         1    0  38.0                   1  ...  First        C    Cherbourg     n
2         1    0  26.0                   0  ...  Third  unknown  Southampton     y
3         1    0  35.0                   1  ...  First        C  Southampton     n
4         0    1  28.0                   0  ...  Third  unknown   Queenstown     y

果然性别（sex）一列变成数字了，将其他的列也变一下，并且把class这个列删除掉，class这个名字比较敏感。

df['deck'] = pd.Categorical(df['deck'])
df['deck'] = df.deck.cat.codes

df['embark_town'] = pd.Categorical(df['embark_town'])
df['embark_town'] = df.embark_town.cat.codes

df['alone'] = pd.Categorical(df['alone'])
df['alone'] = df.alone.cat.codes
df.drop('class', axis=1, inplace=True)

print(df.head())

读取数据

将alone字段设置为要预测的值，df.pop()选哪个哪个就是标签值，只能选一次。进入到tf.data.Dataset.from_tensor_slices函数中，打印的时候采用take这个函数。

alone = df.pop('survived')
dataset = tf.data.Dataset.from_tensor_slices((df.values, alone.values))
for feat, targ in dataset.take(5):
  	print ('Features: {}, Target: {}'.format(feat, targ))

随机读取一下，并设定批次。PS. 这就没有设置repeat和cache。
批次为一，shuffle的值是整体数据长度。

train_dataset = dataset.shuffle(len(df)).batch(1)

模型准备并跑起来

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model

model = get_compiled_model()
model.fit(train_dataset, epochs=15)

代替特征列

也可以采用字典的方式，将数据传输到模型中。
键值对的值采用tf.keras.layers.Input(shape=(), name=key)函数做的。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

效果是一样的

({'survived': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1])>, 
'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0])>, 
'age': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([22., 38., 26., 35., 28.,  2., 27., 14.,  4., 20., 39., 14.,  2., 28., 31., 28.], dtype=float32)>, 
'n_siblings_spouses': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 0, 1, 0, 3, 0, 1, 1, 0, 1, 0, 4, 0, 1, 0])>, 
'parch': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 5, 0, 1, 0, 0, 0])>, 
'fare': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([ 7.25  , 71.2833,  7.925 , 53.1   ,  8.4583, 21.075 , 11.1333, 30.0708, 16.7   ,  8.05  , 31.275 ,  7.8542, 29.125 , 13.  , 18.  ,  7.225 ], dtype=float32)>,    
'deck': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([7, 2, 7, 2, 7, 7, 7, 7, 6, 7, 7, 7, 7, 7, 7, 7])>, 
'embark_town': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 0])>}, <tf.Tensor: shape=(16,), dtype=int8, numpy=array([0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1], dtype=int8)>)

TFRecord 和 tf.Example

TFRecord 和 tf.Example官方例子

为了高效地读取数据，比较有帮助的一种做法是对数据进行序列化并将其存储在一组可线性读取的文件（每个文件100-200MB）中。这尤其适用于通过网络进行流式传输的数据。这种做法对缓冲任何数据预处理也十分有用。

TFRecord 格式是一种用于存储二进制记录序列的简单格式。

协议缓冲区是一个跨平台、跨语言的库，用于高效地序列化结构化数据。

协议消息由 .proto 文件定义，这通常是了解消息类型最简单的方法。

tf.Example 消息（或 protobuf）是一种灵活的消息类型，表示 {“string”: value} 映射。它专为TensorFlow 而设计，并被用于 TFX 等高级 API。

本笔记本将演示如何创建、解析和使用 tf.Example 消息，以及如何在 .tfrecord 文件之间对 tf.Example消息进行序列化、写入和读取。

注：这些结构虽然有用，但并不是强制的。您无需转换现有代码即可使用 TFRecord，除非您正在使用 tf.data且读取数据仍是训练的瓶颈。有关数据集性能的提示，请参阅数据输入流水线性能。

我的理解就是跟我们从网上下载文件一样，采用流的方式，一点一点下载，这样如果断了，还可以接着下，不必从头开始。数据也是一样的。

这个先了解

文本数据

文本数据，了解下tf.data.TextLineDataset的方法。

原文件中的一行为一个样本，这适用于大多数的基于行的文本数据

引包

import tensorflow as tf

import tensorflow_datasets as tfds
import os

数据准备

下载规整好的数据。

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

数据变成 tf.data.TextLineDataset

将数据压进tf.data.Dataset.map，用tf.data.TextLineDataset读取数据，

def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)
# 把这三个数据集变成一个，再混淆掉
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

来看一眼里面的数据是什么样子的

for ex in all_labeled_data.take(5):
    print(ex)

numpy就是每个Tensor 值。

(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

数据变为数字变量

有了文字，那就把文字变成数字，方便计算机计算，盲猜是onehot或者向量嵌入。
官网例子是tfds.features.text.Tokenizer()需要改成tfds.deprecated.text.Tokenizer()，这个库更新的也太快了吧，官网例子都跟不上。

# 建立词汇表
tokenizer = tfds.deprecated.text.Tokenizer()
# 遍历所有的文本，留下不重复的token，集合的妙用
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)
# 词汇表总长度
vocab_size = len(vocabulary_set)
print(vocab_size)  # 17178

官网的例子tfds.features.text.TokenTextEncoder需要改成tfds.deprecated.text.TokenTextEncoder，采用tfds.deprecated.text.TokenTextEncoder将样本进行编码。PS.应该现在有更好的函数吧，要不也不能废弃。

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)  # b'With slanted shields, the Greeks; yet Hector still'
encoded_example = encoder.encode(example_text)
print(encoded_example)  # [8132, 15145, 4866, 10461, 7732, 465, 17108, 13725]

Tensorflow的思想总是在用到的时候再做处理。我写PyTorch的时候总是喜欢把数据都处理的特别稳妥后再处理。

def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):
  # py_func doesn't set the shape of the returned tensors.
  encoded_text, label = tf.py_function(encode, 
                                       inp=[text, label], 
                                       Tout=(tf.int64, tf.int64))

  # `tf.data.Datasets` work best if all components have a shape set
  #  so set the shapes manually: 
  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label


all_encoded_data = all_labeled_data.map(encode_map_fn)

划分训练测试集

tf.data.Dataset.take 和 tf.data.Dataset.skip可以看作是从tf.data.Dataset拿多少数据出来以及略去多少剩下的全要。还是tf.data.Dataset.padded_batch好用，全都自动化，不用自己去写补全代码，直接就是自动填充。因为文本不像是图片，长短不一，也不可以去拉伸什么的，只能在后面补零成相同的大小。

train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

看下数据奥

# 这个是一个批次的集合
sample_text, sample_labels = next(iter(test_data))
# 从批次里选第一个看看
print(sample_text[0], sample_labels[0])

结果，可以看到后面都是补了零

(<tf.Tensor: shape=(15,), dtype=int64, numpy=
 array([ 8132, 15145,  4866, 10461,  7732,   465, 17108, 13725,     0,
            0,     0,     0,     0,     0,     0], dtype=int64)>,
 <tf.Tensor: shape=(), dtype=int64, numpy=1>)

零也是一个token，那么词汇表的大小就要加一了

vocab_size += 1

模型准备

果然还是用向量嵌入的方式，接下来就是LSTM

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 64))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

# 一个或多个紧密连接的层
# 编辑 `for` 行的列表去检测层的大小
for units in [64, 64]:
  	model.add(tf.keras.layers.Dense(units, activation='relu'))

# 输出层。第一个参数是标签个数，激活函数是softmax。
model.add(tf.keras.layers.Dense(3, activation='softmax'))
# 一般softmax用sparse_categorical_crossentropy作损失函数。
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

跑起来

model.fit(train_data, epochs=3, validation_data=test_data)

eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))

结果

Eval loss: 0.38248714804649353, Eval accuracy: 0.8285999894142151

总结

Tensorflow真的强大，但是前提要熟悉它的一些API。

赫凯

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Tensorflow2.0学习-加载和预处理数据 (七)

文章目录

图像

引包

数据准备

下载数据

随机打印数据

查看图片

图对应的标签

加载格式化图片

tf.data.Dataset

跑起来

数据集训练参数设置

传入模型中

总结

CSV

引包

数据准备

数据预处理

离散的数据

连续的浮点数据

模型准备

跑起来

Numpy

引包

数据准备

跑起来

pandas dataframes

引包

数据准备

读取数据

模型准备并跑起来

代替特征列

TFRecord 和 tf.Example

文本数据

引包

数据准备

数据变成 tf.data.TextLineDataset

数据变为数字变量

划分训练测试集

模型准备

跑起来

总结