机器学习实践—基于Scikit-Learn、Keras和TensorFlow2第二版—第13章 使用TensorFlow加载和预处理数据(Chapter13 Loading and Preproces

目前为止,我们只使用了能放在内存中的数据集,而深度学习往往使用非常大而无法放在RAM中的数据集。其它深度学习库可能对处理这种大型数据集比较棘手,但是TensorFlow很容易完成,这得归功于其数据API(Data API),即只需创建一个数据对象,然后赋值其数据位置和转换方法即可。TensorFlow会处理好各种细节,例如多线程、队列、批处理等等。同时TensorFlow数据API与tf.keras合作非常好。

TensorFlow数据API可以读取文本文件、二进制文件。TFRecord是一个易用且高效的格式,基于Protocol Buffers格式。TensorFlow数据API也支持从SQL数据库中读取数据。

读取数据后还需要对其进行预处理,例如归一化。对于数据预处理,可以自定义预处理的层,也可以使用Keras提供的标准预处理层。

TF Transform (tf.Transform):编写训练集上按批次进行预处理的函数,然后转换成TF函数并加入到模型结构中,使得模型部署到生产环境中时可以很快地进行样本预处理。

TF Datasets (TFDS):提供各种常用数据集下载的函数,包括ImageNet等大型数据集。

0. 导入所需的库

import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import pandas as pd
import os

import tensorflow_datasets as tfds
import tensorflow_hub as hub
import tensorflow_transform as tft


for i in (tf, np, mpl, pd, tfds, hub, tft):
    print("{}: {}".format(i.__name__, i.__version__))

输出:

tensorflow: 2.2.0
numpy: 1.17.4
matplotlib: 3.1.2
pandas: 0.25.3
tensorflow_datasets: 3.1.0
tensorflow_hub: 0.8.0
tensorflow_transform: 0.22.0

1. 数据API(The Data API)

tf.data.Datasets.from_tensor_slices():创建一个存储于内存中的数据集:

X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

输出:

<TensorSliceDataset shapes: (), types: tf.int32>

from_tensor_slices()函数传入一个张量,返回一个tf.data.Dataset,其中元素为输入张量的所有切片,结果与下面写法是等价的:

dataset = tf.data.Dataset.range(10)

for item in dataset:
    print(item)

输出:

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)

1.1 链式转换(Chaining Transformations)

下面例子中,首先使用repeat()方法将数据集重复3次,返回一个新的数据集

然后在新数据集上调用batch()方法,每7个数据为一个批次,也返回一个新数据集

dataset = dataset.repeat(3).batch(7)

for item in dataset:
    print(item)

输出:

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)

可以设置drop_remainder=True将最后剩余不足一个batch的数据丢弃:

dataset = tf.data.Dataset.range(10)
dataset = dataset.repeat(3).batch(7, drop_remainder=True)

for item in dataset:
    print(item)

输出:

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)

注意:数据API实例的方法不会修改原始数据,每个方法会返回一个新数据集。

也可以调用map()方法进行数据转换:

dataset = tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: x * 2)
dataset = dataset.repeat(3).batch(7, drop_remainder=True)

for item in dataset:
    print(item)

输出:

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int64)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int64)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int64)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int64)

注意:对于计算密集性的转换,可以通过num_parallel_calls参数设置线程数。同时map()中传入的函数必须是可以转换成TF函数的。

map()方法将函数应用到每个样本,而apply()方法将函数应用到数据集整体:

dataset = dataset.apply(tf.data.experimental.unbatch())
dataset = dataset.filter(lambda x: x < 10)  # keep only items < 10
for item in dataset.take(5):
    print(item)

输出:

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)

1.2 数据洗牌(Shuffling the Data)

当训练集中的样本是独立同分布的,则梯度下降的效果最好。保证样本独立同分布的最简单的方法就是使用shuffle()方法对数据集进行洗牌操作。shuffle()操作将产生一个buffer大小的新数据集,每次取数据时从buffer中拿一个数据,并从原始数据集中拿数据补充到buffer中,直到取完原始数据集和buffer中的数据。buffer尽量设大一些,否则洗牌操作可能不充分,但不要超过RAM。

dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)

for item in dataset:
    print(item)

输出:

tf.Tensor([0 3 4 2 1 5 8], shape=(7,), dtype=int64)
tf.Tensor([6 9 7 2 3 1 4], shape=(7,), dtype=int64)
tf.Tensor([6 0 7 9 0 1 2], shape=(7,), dtype=int64)
tf.Tensor([8 4 5 5 3 8 9], shape=(7,), dtype=int64)
tf.Tensor([7 6], shape=(2,), dtype=int64)

如上输出所示,产生0-9的数据并重复3次,设置buffer大小为3,批大小为7。

repeat()方法每次也会产生新的顺序,因此是在进行测试或调试而需要固定顺序时,可以设置reshuffle_each_iteration=False。

对于无法放进内存中的大型数据集,这样洗牌的方法好像不太充分,因为buffer相比于数据集的大小就显得太小了。其中有种解决办法是对数据本身进行洗牌操作,例如可以将原始数据拆分成多个子文件,并在训练时随机读入子文件。然而子文件中的样本还是固定顺序的,这时可以同时读入多个文件,然后每次同时读这些文件(比如每个文件读一行)。这些操作都可以由TensorFlow数据API使用简单的一行代码就能完成。

1.3 多文件行数据交叉(Interleaving lines from multiple files)

加载加利福尼亚房价数据集并进行洗牌操作,拆分成训练集、验证集和测试集。

然后将每个数据集拆分成多个CSV文件:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing() # 加载数据集
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1), 
                                                              random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)

X_mean = scaler.mean_
X_std = scaler.scale_
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

pd.read_csv(train_filepaths[0]).head()

输出:

with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

输出:

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621
train_filepaths

输出:

['datasets\\housing\\my_train_00.csv',
 'datasets\\housing\\my_train_01.csv',
 'datasets\\housing\\my_train_02.csv',
 'datasets\\housing\\my_train_03.csv',
 'datasets\\housing\\my_train_04.csv',
 'datasets\\housing\\my_train_05.csv',
 'datasets\\housing\\my_train_06.csv',
 'datasets\\housing\\my_train_07.csv',
 'datasets\\housing\\my_train_08.csv',
 'datasets\\housing\\my_train_09.csv',
 'datasets\\housing\\my_train_10.csv',
 'datasets\\housing\\my_train_11.csv',
 'datasets\\housing\\my_train_12.csv',
 'datasets\\housing\\my_train_13.csv',
 'datasets\\housing\\my_train_14.csv',
 'datasets\\housing\\my_train_15.csv',
 'datasets\\housing\\my_train_16.csv',
 'datasets\\housing\\my_train_17.csv',
 'datasets\\housing\\my_train_18.csv',
 'datasets\\housing\\my_train_19.csv']

list_files()返回一个经过洗牌的文件路径数据集,可以设置shuffle=False表示不打散:

filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

for filepath in filepath_dataset:
    print(filepath)

输出:

tf.Tensor(b'datasets\\housing\\my_train_05.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_16.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_01.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_17.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_00.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_14.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_10.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_02.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_12.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_19.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_07.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_09.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_13.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_15.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_11.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_18.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_04.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_06.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_03.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_08.csv', shape=(), dtype=string)

使用interleave()方法每次交错地从5个文件中读取数据,skip()跳过头几行:

n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers) # 跳过第一行:表头

for line in dataset.take(5):
    print(line.numpy())

输出:

b'4.5909,16.0,5.475877192982456,1.0964912280701755,1357.0,2.9758771929824563,33.63,-117.71,2.418'
b'2.4792,24.0,3.4547038327526134,1.1341463414634145,2251.0,3.921602787456446,34.18,-118.38,2.0'
b'4.2708,45.0,5.121387283236994,0.953757225433526,492.0,2.8439306358381504,37.48,-122.19,2.67'
b'2.1856,41.0,3.7189873417721517,1.0658227848101265,803.0,2.0329113924050635,32.76,-117.12,1.205'
b'4.1812,52.0,5.701388888888889,0.9965277777777778,692.0,2.4027777777777777,33.73,-118.31,3.215'

注意:使用interleave()方法交错地读数据时应该保证每个文件长度(行数)相同,否则较长的文件后面的数据行可能读不到。

interleave()方法默认每次从1个文件读一行,是串行化的操作。如果想要并行地读数据,可以使用num_parallel_calls参数设置线程数,可以赋值成tf.data.experimental.AUTOTUNE,表示让TensorFlow自动根据CPU使用情况动态地设置线性数。

1.4 数据预处理

#X_mean, X_std = [...] # mean and scale of each feature in the training set
n_inputs = 8 # X_train.shape[-1]

@tf.function
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

假设已经计算得到训练集的平均值和标准差。

tf.io.decode_csv():参数1表示需要解析的数据行,参数2表示每列默认值或缺省值的数组

decode_csv()函数返回标量张量的列表,tf.stack()将张量转换成1维数组。

最后对输入特征进行归一化:减去平均值并除以标准差

preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

输出:

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579157,  1.216324  , -0.05204565, -0.39215982, -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)
record_defaults=[0, np.nan, tf.constant(np.nan, dtype=tf.float64), "Hello", tf.constant([])]
parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults=record_defaults)
parsed_fields

输出:

[<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.0>,
 <tf.Tensor: shape=(), dtype=float64, numpy=3.0>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'4'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

所有缺失值将被用默认值填充:

parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields

输出:

[<tf.Tensor: shape=(), dtype=int32, numpy=0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=nan>,
 <tf.Tensor: shape=(), dtype=float64, numpy=nan>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'Hello'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

第5个部分默认值是tf.constant([]),因此不能为空,否则报错:

try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

输出:

Field 4 is required but missing in record 0! [Op:DecodeCSV]

字段数必须匹配,否则会报错:

try:
    parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

输出:

Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]

1.5 将所有步骤放在一起形成函数(Putting Everything Together)

将所有数据预处理步骤放在一起组成函数,方便重用:

def csv_reader_dataset(filepaths, repeat=1, n_readers=5, n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
    dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
                                 cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
    print("X =", X_batch)
    print("y =", y_batch)
    print()

输出:

X = tf.Tensor(
[[ 1.0026022  -0.2867314   0.01174602 -0.0658901  -0.38811532  0.07317533
   0.8215112  -1.2472363 ]
 [-0.74581194 -0.8404887  -0.21125445 -0.02732265  3.6885073  -0.20515272
   0.5404227  -0.07777973]
 [-0.67934674 -0.44494775 -0.76743394 -0.14639002 -0.05045014  0.268618
  -0.5745582  -0.0427962 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[3.087]
 [0.743]
 [2.326]], shape=(3, 1), dtype=float32)

X = tf.Tensor(
[[ 0.6130298  -1.7106786   4.2995334   2.9747813  -0.859934    0.0307362
   1.7350441  -0.3026771 ]
 [ 2.1392124   1.8491895   0.88371885  0.11082522 -0.5313949  -0.3833385
   1.0089018  -1.4271526 ]
 [ 1.200269   -0.998705    1.1007434  -0.15711978  0.43597025  0.17005198
  -1.1976345   1.2715893 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[1.741  ]
 [5.00001]
 [3.356  ]], shape=(3, 1), dtype=float32)

1.6 预提取(Prefetching)

上面函数最后调用prefetch(1),表示模型训练的时候提前准备好1批数据,准备工作包括从磁盘读入数据并进行预处理。这种方式可以显著提高效率。

1.7 利用tf.keras使用Dataset(Using the Dataset with tf.keras)

下面可以使用上面定义的预处理函数读入数据并进行模型训练:

train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

batch_size = 32
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size, epochs=10, validation_data=valid_set)

输出:

Epoch 1/10
362/362 [==============================] - 1s 3ms/step - loss: 1.6426 - val_loss: 1.1271
Epoch 2/10
362/362 [==============================] - 1s 2ms/step - loss: 0.8001 - val_loss: 0.7965
Epoch 3/10
362/362 [==============================] - 1s 2ms/step - loss: 0.7323 - val_loss: 0.7688
Epoch 4/10
362/362 [==============================] - 1s 2ms/step - loss: 0.6579 - val_loss: 0.7544
Epoch 5/10
362/362 [==============================] - 1s 2ms/step - loss: 0.6361 - val_loss: 0.6197
Epoch 6/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5925 - val_loss: 0.6548
Epoch 7/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5661 - val_loss: 0.5478
Epoch 8/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5527 - val_loss: 0.5200
Epoch 9/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5298 - val_loss: 0.5015
Epoch 10/10
362/362 [==============================] - 1s 2ms/step - loss: 0.4920 - val_loss: 0.5561

<tensorflow.python.keras.callbacks.History at 0x43a2a358>
model.evaluate(test_set, steps=len(X_test) // batch_size)

输出:

161/161 [==============================] - 0s 1ms/step - loss: 0.4933

0.49326828122138977
new_set = test_set.map(lambda X, y: X) # we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new) // batch_size)

输出:

array([[1.9393051 ],
       [0.83419716],
       [1.8839278 ],
       ...,
       [1.8963358 ],
       [3.858193  ],
       [1.2957773 ]], dtype=float32)

自定义训练循环:

optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error

n_epochs = 5
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0
for X_batch, y_batch in train_set.take(total_steps):
    global_step += 1
    print("\rGlobal step {}/{}".format(global_step, total_steps), end="")
    with tf.GradientTape() as tape:
        y_pred = model(X_batch)
        main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
        loss = tf.add_n([main_loss] + model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

输出:

Global step 1810/1810

创建TF函数用于训练循环:

optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size=32,
          n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
                       n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
                       n_parse_threads=n_parse_threads, batch_size=batch_size)
    for X_batch, y_batch in train_set:
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train(model, 5)
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size=32,
          n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
                       n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
                       n_parse_threads=n_parse_threads, batch_size=batch_size)
    n_steps_per_epoch = len(X_train) // batch_size
    total_steps = n_epochs * n_steps_per_epoch
    global_step = 0
    for X_batch, y_batch in train_set.take(total_steps):
        global_step += 1
        if tf.equal(global_step % 100, 0):
            tf.print("\rGlobal step", global_step, "/", total_steps)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train(model, 5)

Dataset类方法简介:

for m in dir(tf.data.Dataset):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))

输出:

● apply()              Applies a transformation function to this dataset.
● as_numpy_iterator()  Returns an iterator which converts all elements of the dataset to numpy.
● batch()              Combines consecutive elements of this dataset into batches.
● cache()              Caches the elements in this dataset.
● concatenate()        Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()       The type specification of an element of this dataset.
● enumerate()          Enumerates the elements of this dataset.
● filter()             Filters this dataset according to `predicate`.
● flat_map()           Maps `map_func` across this dataset and flattens the result.
● from_generator()     Creates a `Dataset` whose elements are generated by `generator`.
● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors.
● from_tensors()       Creates a `Dataset` with a single element, comprising the given tensors.
● interleave()         Maps `map_func` across this dataset, and interleaves the results.
● list_files()         A dataset of all files matching one or more glob patterns.
● map()                Maps `map_func` across the elements of this dataset.
● options()            Returns the options for this dataset and its inputs.
● padded_batch()       Combines consecutive elements of this dataset into padded batches.
● prefetch()           Creates a `Dataset` that prefetches elements from this dataset.
● range()              Creates a `Dataset` of a step-separated range of values.
● reduce()             Reduces the input dataset to a single element.
● repeat()             Repeats this dataset so each original value is seen `count` times.
● shard()              Creates a `Dataset` that includes only 1/`num_shards` of this dataset.
● shuffle()            Randomly shuffles the elements of this dataset.
● skip()               Creates a `Dataset` that skips `count` elements from this dataset.
● take()               Creates a `Dataset` with at most `count` elements from this dataset.
● unbatch()            Splits elements of a dataset into multiple elements.
● window()             Combines (nests of) input elements into a dataset of (nests of) windows.
● with_options()       Returns a new `tf.data.Dataset` with the given options set.
● zip()                Creates a `Dataset` by zipping together the given datasets.

2. TFRecord格式

TFRecord格式是TensorFlow中存储和高效读取大型数据的首先格式,由不同大小二进制序列组成的二进制格式。

TFRecord组成:长度、长度的CRC校验和、数据、数据的CRC校验和。

2.1 TFRecord文件读写

可以使用tf.io.TFRecordWriter类创建TFRecord文件:

with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

可以使用tf.data.TFRecordDataset读取TFRecord文件:

filepaths = ["my_data.tfrecord"] # 可以读取一个或多个文件

dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

输出:

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)

TFRecordDataset默认一个一个地读文件,当设置参数num_parallel_reads时可以并行和交叉地读文件,可以达到与之前介绍的list_files()和interleave()相同的效果:

filepaths = ["my_test_{}.tfrecord".format(i) for i in range(5)]

for i, filepath in enumerate(filepaths):
    with tf.io.TFRecordWriter(filepath) as f:
        for j in range(3):
            f.write("File {} record {}".format(i, j).encode("utf-8"))

dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)
for item in dataset:
    print(item)

输出:

tf.Tensor(b'File 0 record 0', shape=(), dtype=string)
tf.Tensor(b'File 1 record 0', shape=(), dtype=string)
tf.Tensor(b'File 2 record 0', shape=(), dtype=string)
tf.Tensor(b'File 0 record 1', shape=(), dtype=string)
tf.Tensor(b'File 1 record 1', shape=(), dtype=string)
tf.Tensor(b'File 2 record 1', shape=(), dtype=string)
tf.Tensor(b'File 0 record 2', shape=(), dtype=string)
tf.Tensor(b'File 1 record 2', shape=(), dtype=string)
tf.Tensor(b'File 2 record 2', shape=(), dtype=string)
tf.Tensor(b'File 3 record 0', shape=(), dtype=string)
tf.Tensor(b'File 4 record 0', shape=(), dtype=string)
tf.Tensor(b'File 3 record 1', shape=(), dtype=string)
tf.Tensor(b'File 4 record 1', shape=(), dtype=string)
tf.Tensor(b'File 3 record 2', shape=(), dtype=string)
tf.Tensor(b'File 4 record 2', shape=(), dtype=string)

2.2 TFRecord文件压缩

options = tf.io.TFRecordOptions(compression_type="GZIP")

with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")
for item in dataset:
    print(item)

输出:

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)

2.3 Protocol Buffers简介

TFRecord由序列化的protocol buffers(又称protobufs)组成,TFRecord是一种可移植、可扩展且高效的二进制格式,由Google在2001年开发并在2008年开源。

%%writefile person.proto
syntax = "proto3";
message Person {
  string name = 1;
  int32 id = 2;
  repeated string email = 3;
}

输出:

Overwriting person.proto

如上代码表示:使用版本3的protobuf格式,指定每个Person对象具有字符串类型的name属性,int32类型的id,0个或多个类型为字符串的email字段,其中数字1、2、3表示三个字段的标识符。

可以使用protobuf编译器protoc对.proto文件进行编译,产生可以在Python中访问的类。

可以在命令行中进行如下所示编译过程:

!protoc-3.12.3-win64\bin\protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports
!dir person*

编译完生成了person_pb2.py文件,通过该文件可以导入Person类:

from person_pb2 import Person

person = Person(name="Al", id=123, email=["a@b.com"])  # create a Person
print(person)  # display the Person

输出:

name: "Al"
id: 123
email: "a@b.com"
person.name  # read a field

输出:

'Al'
person.name = "Alice"  # modify a field

person.name

输出:

'Alice'
person.email[0]  # repeated fields can be accessed like arrays

输出:

'a@b.com'
person.email.append("c@d.com")  # add an email address

person.email

输出:

['a@b.com', 'c@d.com']
s = person.SerializeToString()  # serialize to a byte string
s

输出:

b'\n\x05Alice\x10{\x1a\x07a@b.com\x1a\x07c@d.com'
person2 = Person()  # create a new Person
person2.ParseFromString(s)  # parse the byte string (27 bytes)

输出:

27
person == person2  # now they are equal

输出:

True

自定义protobuf

person_tf = tf.io.decode_proto(bytes=s, message_type="Person", field_names=["name", "id", "email"],
                               output_types=[tf.string,tf.int32,tf.string],descriptor_source="person.desc")

person_tf.values

输出:

[<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Alice'], dtype=object)>,
 <tf.Tensor: shape=(1,), dtype=int32, numpy=array([123])>,
 <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>]

2.4 TensorFlow Protobufs

protobuf在TensorFlow中的使用例子如下:

syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List
Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example

person_example = Example(features=Features(feature={"name": Feature(bytes_list=BytesList(value=[b"Alice"])),
                                                    "id": Feature(int64_list=Int64List(value=[123])),
                                                    "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        }))

with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

2.5 加载和解析Examples)(Loading and Parsing Examples)

使用tf.data.TFRecordDataset可以加载序列化的Example protobuf:

feature_description = {"name": tf.io.FixedLenFeature([], tf.string, default_value=""),
                       "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
                       "emails": tf.io.VarLenFeature(tf.string)}

for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example, feature_description)
    
parsed_example

输出:

{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x3ec19630>,
 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>,
 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'Alice'>}

固定长度的特征被解析成标量张量,可变长度的特征解析成稀疏张量(sparse tensors),可以使用tf.sparse.to_dense()将稀疏张量转换成密集张量:

parsed_example["emails"].values[0]

输出:

<tf.Tensor: shape=(), dtype=string, numpy=b'a@b.com'>
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")

输出:

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>
parsed_example["emails"].values

输出:

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

BytesList可以包含任何二进制的数据,例如可以使用tf.io.encode_jpeg()使JPEG图片编码并将其二进制数据存入BytesList格式:

from sklearn.datasets import load_sample_images

img = load_sample_images()["images"][0]
plt.imshow(img)
plt.axis("off")
plt.title("Original Image")

plt.show()

输出:

data = tf.io.encode_jpeg(img) # 对JPEG图像进行编码

example_with_image = Example(features=Features(feature={
    "image": Feature(bytes_list=BytesList(value=[data.numpy()]))}))
serialized_example = example_with_image.SerializeToString()
# then save to TFRecord

从TFRecord读取时,需要使用tf.io.decode_jpeg()或tf.io.decode_image()(可以解码BMP、GIF、JPEG、PNG格式)解析数据并得到原始图像

feature_description = { "image": tf.io.VarLenFeature(tf.string) }

example_with_image = tf.io.parse_single_example(serialized_example, feature_description) # 读入TFRecord

decoded_img = tf.io.decode_jpeg(example_with_image["image"].values[0]) # 解码成原始图像
#decoded_img = tf.io.decode_image(example_with_image["image"].values[0]) # 或者使用decode_image()

plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")

plt.show()

输出:

t = tf.constant([[0., 1.], [2., 3.], [4., 5.]])
s = tf.io.serialize_tensor(t)
s

输出:

<tf.Tensor: shape=(), dtype=string, numpy=b'\x08\x01\x12\x08\x12\x02\x08\x03\x12\x02\x08\x02"\x18\x00\x00\x00\x00\x00\x00\x80?\x00\x00\x00@\x00\x00@@\x00\x00\x80@\x00\x00\xa0@'>
tf.io.parse_tensor(s, out_type=tf.float32)

输出:

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[0., 1.],
       [2., 3.],
       [4., 5.]], dtype=float32)>
serialized_sparse = tf.io.serialize_sparse(parsed_example["emails"])
serialized_sparse

输出:

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'\x08\t\x12\x08\x12\x02\x08\x02\x12\x02\x08\x01"\x10\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00',
       b'\x08\x07\x12\x04\x12\x02\x08\x02"\x10\x07\x07a@b.comc@d.com',
       b'\x08\t\x12\x04\x12\x02\x08\x01"\x08\x02\x00\x00\x00\x00\x00\x00\x00'],
      dtype=object)>
BytesList(value=serialized_sparse.numpy())

输出:

value: "\010\t\022\010\022\002\010\002\022\002\010\001\"\020\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000"
value: "\010\007\022\004\022\002\010\002\"\020\007\007a@b.comc@d.com"
value: "\010\t\022\004\022\002\010\001\"\010\002\000\000\000\000\000\000\000"
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples, feature_description)
    
parsed_examples

输出:

{'image': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x4c367c18>}

2.6 使用SequenceExample Protobuf处理Lists of Lists数据

使用Example protobuf基本能够完成大部分任务,但是对于list of list的数据就显得有些不方便,TensorFlow中提供SequenceExample专门用于处理这种数据。

syntax = "proto3";

message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
  Features context = 1;
  FeatureLists feature_lists = 2;
};

一个SequenceExample包含一个Features对象的文本数据,一个FeatureLists对象包含一个或多个FeatureList对象,每个FeatureList包含一系列的Feature对象,每个Feature对象可以是字节串列表、64位整数列表、浮点数列表等。

创建SequenceExample的方法与Example类似:

FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample

context = Features(feature={"author_id": Feature(int64_list=Int64List(value=[123])),
                            "title": Feature(bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."])),
                            "pub_date": Feature(int64_list=Int64List(value=[1623, 12, 25]))
                           })

content = [["When", "shall", "we", "three", "meet", "again", "?"],
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]]

def words_to_feature(words):
    return Feature(bytes_list=BytesList(value=[word.encode("utf-8") for word in words]))

content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]
            
sequence_example = SequenceExample(context=context,
                                   feature_lists=FeatureLists(feature_list={
                                       "content": FeatureList(feature=content_features),
                                       "comments": FeatureList(feature=comments_features)
                                   }))

sequence_example

输出:

context {
  feature {
    key: "author_id"
    value {
      int64_list {
        value: 123
      }
    }
  }
  feature {
    key: "pub_date"
    value {
      int64_list {
        value: 1623
        value: 12
        value: 25
      }
    }
  }
  feature {
    key: "title"
    value {
      bytes_list {
        value: "A"
        value: "desert"
        value: "place"
        value: "."
      }
    }
  }
}
feature_lists {
  feature_list {
    key: "comments"
    value {
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "hurlyburly"
          value: "\'s"
          value: "done"
          value: "."
        }
      }
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "battle"
          value: "\'s"
          value: "lost"
          value: "and"
          value: "won"
          value: "."
        }
      }
    }
  }
  feature_list {
    key: "content"
    value {
      feature {
        bytes_list {
          value: "When"
          value: "shall"
          value: "we"
          value: "three"
          value: "meet"
          value: "again"
          value: "?"
        }
      }
      feature {
        bytes_list {
          value: "In"
          value: "thunder"
          value: ","
          value: "lightning"
          value: ","
          value: "or"
          value: "in"
          value: "rain"
          value: "?"
        }
      }
    }
  }
}
serialized_sequence_example = sequence_example.SerializeToString()

context_feature_descriptions = {"author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
                                "title": tf.io.VarLenFeature(tf.string),
                                "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),
                               }
sequence_feature_descriptions = {"content": tf.io.VarLenFeature(tf.string),
                                 "comments": tf.io.VarLenFeature(tf.string),
                                }
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions, sequence_feature_descriptions)

parsed_context

输出:

{'title': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x4c355438>,
 'author_id': <tf.Tensor: shape=(), dtype=int64, numpy=123>,
 'pub_date': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1623,   12,   25], dtype=int64)>}
parsed_context["title"].values

输出:

<tf.Tensor: shape=(4,), dtype=string, numpy=array([b'A', b'desert', b'place', b'.'], dtype=object)>
parsed_feature_lists

输出:

{'comments': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x4c355fd0>,
 'content': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x4c3558d0>}
print(tf.RaggedTensor.from_sparse(parsed_feature_lists["content"]))

输出:

<tf.RaggedTensor [[b'When', b'shall', b'we', b'three', b'meet', b'again', b'?'], [b'In', b'thunder', b',', b'lightning', b',', b'or', b'in', b'rain', b'?']]>

3. 输入特征预处理(Preprocessing the Input Features)

任何形式的数据喂入神经网络前需要先将数据所有特征转换成数值,通常还需要归一化。

3.1 使用One-Hot向量编码类别特征

加利福尼亚房价数据集中,ocean_proximity特征是类别特征,共有5种值:<1H OCEAN、INLAND、NEAR OCEAN、NEAR BAY、ISLAND,可以考虑将其转换成one_hot编码。

注意:类别小于10时通常使用one-hot编码,类别大于50时通常使用嵌入(embddings)的方式,而类别在10到50之间就需要尝试两种方法,根据结果好坏选择一种使用。

3.2 使用嵌入(Embeddings)编码类别特征

词嵌入:自然语言处理中常用的技术之一,一般会使用预训练好的词嵌入向量。

3.3 Keras预处理层

import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

HOUSING_PATH = os.path.join("datasets", "housing")
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

输出:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY
housing_median_age = tf.feature_column.numeric_column("housing_median_age")
housing_median_age

输出:

NumericColumn(key='housing_median_age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

4. TF Transform

TF Transform用于模型生产化的端到端的工具,是TensorFlow Extended的一部分。

TF Transform需要额外安装,没有与TensorFlow绑定安装

try:
    import tensorflow_transform as tft

    def preprocess(inputs):  # inputs is a batch of input features
        median_age = inputs["housing_median_age"]
        ocean_proximity = inputs["ocean_proximity"]
        standardized_age = tft.scale_to_z_score(median_age - tft.mean(median_age))
        ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
        return {"standardized_median_age": standardized_age,
                "ocean_proximity_id": ocean_proximity_id }
except ImportError:
    print("TF Transform is not installed. Try running: pip3 install -U tensorflow-transform")

然后TF Transform可以利用Apache Beam将preprocess()函数应用到整个训练集上,在这个过程中,将会计算整个训练集的一些统计指标,例如本例中计算平均值和标准差。做这些计算的部件叫作分析器。

更重要的是,TF Transform可以产生一个等价的TensorFlow函数,该函数可以添加到想要部署的模型中。

利用Data API、TFRecords、Keras预处理层和TF Transform就可以实现用于训练的高度可扩展的输入pipeline,并且在生产环境中使得数据预处理更加快速和便携。

5. TensorFlow Datasets (TFDS)项目

TFDS使得下载常用数据集变得非常简单,小到MNIST、Fashion MNIST数据集,大于ImageNet数据集。

TFDS支持的数据集包括图像数据集、文本数据集(包括翻译数据集)、声音和视频数据集,完整的数据集列表请查看官网:https://www.tensorflow.org/datasets/datasets

TFDS需要额外安装,没有与TensorFlow绑定安装,使用tfds.load()即可下载并加载数据:

import tensorflow_datasets as tfds

print(tfds.list_builders())

输出:

['abstract_reasoning', 'aeslc', 'aflw2k3d', 'amazon_us_reviews', 'anli', 'arc', 'bair_robot_pushing_small', 'beans', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'blimp', 'c4', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'cfq', 'chexpert', 'cifar10', 'cifar100', 'cifar10_1', 'cifar10_corrupted', 'citrus_leaves', 'cityscapes', 'civil_comments', 'clevr', 'cmaterdb', 'cnn_dailymail', 'coco', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'common_voice', 'cos_e', 'cosmos_qa', 'covid19sum', 'crema_d', 'curated_breast_imaging_ddsm', 'cycle_gan', 'deep_weeds', 'definite_pronoun_resolution', 'dementiabank', 'diabetic_retinopathy_detection', 'div2k', 'dmlab', 'downsampled_imagenet', 'dsprites', 'dtd', 'duke_ultrasound', 'emnist', 'eraser_multi_rc', 'esnli', 'eurosat', 'fashion_mnist', 'flic', 'flores', 'food101', 'forest_fires', 'gap', 'geirhos_conflict_stimuli', 'german_credit_numeric', 'gigaword', 'glue', 'groove', 'higgs', 'horses_or_humans', 'i_naturalist2017', 'imagenet2012', 'imagenet2012_corrupted', 'imagenet2012_subset', 'imagenet_resized', 'imagenette', 'imagewang', 'imdb_reviews', 'irc_disentanglement', 'iris', 'kitti', 'kmnist', 'lfw', 'librispeech', 'librispeech_lm', 'libritts', 'ljspeech', 'lm1b', 'lost_and_found', 'lsun', 'malaria', 'math_dataset', 'mctaco', 'mnist', 'mnist_corrupted', 'movie_rationales', 'moving_mnist', 'multi_news', 'multi_nli', 'multi_nli_mismatch', 'natural_questions', 'newsroom', 'nsynth', 'nyu_depth_v2', 'omniglot', 'open_images_challenge2019_detection', 'open_images_v4', 'openbookqa', 'opinion_abstracts', 'opinosis', 'opus', 'oxford_flowers102', 'oxford_iiit_pet', 'para_crawl', 'patch_camelyon', 'pet_finder', 'pg19', 'places365_small', 'plant_leaves', 'plant_village', 'plantae_k', 'qa4mre', 'quickdraw_bitmap', 'reddit', 'reddit_disentanglement', 'reddit_tifu', 'resisc45', 'robonet', 'rock_paper_scissors', 'rock_you', 'samsum', 'savee', 'scan', 'scene_parse150', 'scicite', 'scientific_papers', 'shapes3d', 'smallnorb', 'snli', 'so2sat', 'speech_commands', 'squad', 'stanford_dogs', 'stanford_online_products', 'starcraft_video', 'stl10', 'sun397', 'super_glue', 'svhn_cropped', 'ted_hrlr_translate', 'ted_multi_translate', 'tedlium', 'tf_flowers', 'the300w_lp', 'tiny_shakespeare', 'titanic', 'trivia_qa', 'uc_merced', 'ucf101', 'vgg_face2', 'visual_domain_decathlon', 'voc', 'voxceleb', 'voxforge', 'waymo_open_dataset', 'web_questions', 'wider_face', 'wiki40b', 'wikihow', 'wikipedia', 'winogrande', 'wmt14_translate', 'wmt15_translate', 'wmt16_translate', 'wmt17_translate', 'wmt18_translate', 'wmt19_translate', 'wmt_t2t_translate', 'wmt_translate', 'wordnet', 'xnli', 'xsum', 'yelp_polarity_reviews']
datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]

plt.figure(figsize=(6,3))
mnist_train = mnist_train.repeat(5).batch(32).prefetch(1)

for item in mnist_train:
    images = item["image"]
    labels = item["label"]
    for index in range(5):
        plt.subplot(1, 5, index + 1)
        image = images[index, ..., 0]
        label = labels[index].numpy()
        plt.imshow(image, cmap="binary")
        plt.title(label)
        plt.axis("off")
    break # just showing part of the first batch

输出:

datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
mnist_train = mnist_train.repeat(5).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)
for images, labels in mnist_train.take(1):
    print(images.shape)
    print(labels.numpy())

输出:

(32, 28, 28, 1)
[4 1 0 7 8 1 2 7 1 6 6 4 7 7 3 3 7 9 9 1 0 6 6 9 9 4 8 9 4 7 3 3]

也可以在load()函数中指定as_supervised=True参数,就可以返回(feature, label)形式的数据,同时也可以指定batch大小:

datasets = tfds.load(name="mnist", batch_size=32, as_supervised=True)
mnist_train = datasets["train"].repeat().prefetch(1)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Lambda(lambda images: tf.cast(images, tf.float32)),
    keras.layers.Dense(10, activation="softmax")])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

model.fit(mnist_train, steps_per_epoch=60000 // 32, epochs=5)
Epoch 1/5
1875/1875 [==============================] - 5s 3ms/step - loss: 32.2213 - accuracy: 0.8415
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 26.2503 - accuracy: 0.8679
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.9240 - accuracy: 0.8737
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.6774 - accuracy: 0.8753
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.2004 - accuracy: 0.8763

<tensorflow.python.keras.callbacks.History at 0x233fa8d0>

6. TensorFlow Hub

import tensorflow_hub as hub

#hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
#                           output_shape=[50], input_shape=[], dtype=tf.string)

hub_layer = hub.KerasLayer("https://hub.tensorflow.google.cn/google/tf2-preview/nnlm-en-dim50/1",
                           output_shape=[50], input_shape=[], dtype=tf.string)

model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

输出:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
keras_layer (KerasLayer)     (None, 50)                48190600  
_________________________________________________________________
dense_1 (Dense)              (None, 16)                816       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 48,191,433
Trainable params: 833
Non-trainable params: 48,190,600
_________________________________________________________________
sentences = tf.constant(["It was a great movie", "The actors were amazing"])
embeddings = hub_layer(sentences)

embeddings

输出:

<tf.Tensor: shape=(2, 50), dtype=float32, numpy=
array([[ 7.45939985e-02,  2.76720114e-02,  9.38646123e-02,
         1.25124469e-01,  5.40293928e-04, -1.09435350e-01,
         1.34755149e-01, -9.57818255e-02, -1.85177118e-01,
        -1.69703495e-02,  1.75612606e-02, -9.06603858e-02,
         1.12110220e-01,  1.04646273e-01,  3.87700424e-02,
        -7.71859884e-02, -3.12189370e-01,  6.99466765e-02,
        -4.88970093e-02, -2.99049795e-01,  1.31183028e-01,
        -2.12630898e-01,  6.96169436e-02,  1.63592950e-01,
         1.05169769e-02,  7.79720694e-02, -2.55230188e-01,
        -1.80790052e-01,  2.93739915e-01,  1.62875261e-02,
        -2.80566931e-01,  1.60284728e-01,  9.87277832e-03,
         8.44555616e-04,  8.39456245e-02,  3.24002892e-01,
         1.53253034e-01, -3.01048346e-02,  8.94618109e-02,
        -2.39153411e-02, -1.50188789e-01, -1.81733668e-02,
        -1.20483577e-01,  1.32937476e-01, -3.35325629e-01,
        -1.46504581e-01, -1.25251599e-02, -1.64428815e-01,
        -7.00765476e-02,  3.60923223e-02],
       [-1.56998575e-01,  4.24599349e-02, -5.57703003e-02,
        -8.08446854e-03,  1.23733155e-01,  3.89427543e-02,
        -4.37901802e-02, -1.86987907e-01, -2.29341656e-01,
        -1.27766818e-01,  3.83025259e-02, -1.07057482e-01,
        -6.11584112e-02,  2.49654502e-01, -1.39712945e-01,
        -3.91289443e-02, -1.35873526e-01, -3.58613044e-01,
         2.53462754e-02, -1.58370987e-01, -1.38350084e-01,
        -3.90771806e-01, -6.63642734e-02, -3.24838236e-02,
        -2.20453963e-02, -1.68282315e-01, -7.40613639e-02,
        -2.49074101e-02,  2.46460736e-01,  9.87201929e-05,
        -1.85390845e-01, -4.92824614e-02,  1.09015472e-01,
        -9.54203904e-02, -1.60352528e-01, -2.59811729e-02,
         1.13778859e-01, -2.09578887e-01,  2.18261331e-01,
        -3.11211571e-02, -6.12562597e-02, -8.66057724e-02,
        -1.10762455e-01, -5.73977083e-03, -1.08923554e-01,
        -1.72919363e-01,  1.00515485e-01, -5.64153939e-02,
        -4.97694984e-02, -1.07776590e-01]], dtype=float32)>

 

 

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值