tensorflow dataset 基础之——tfRecord

最新推荐文章于 2022-01-30 21:07:36 发布

zhao_crystal

最新推荐文章于 2022-01-30 21:07:36 发布

阅读量1.6k

点赞数

分类专栏：深度学习文章标签： tensorflow 人工智能 python

本文链接：https://blog.csdn.net/zhao_crystal/article/details/122382147

版权

深度学习专栏收录该内容

30 篇文章 2 订阅

订阅专栏

本文详细介绍了TensorFlow中的tfRecord格式，包括如何生成和读取tfRecord文件，以及如何处理压缩的tfRecord文件。通过实例展示了从CSV数据生成tfRecord文件的过程，并提供了读取tfRecord文件并构建数据集的方法。此外，还演示了如何利用tfRecord数据训练和评估模型。

摘要由CSDN通过智能技术生成

tf.record是tensorflow中独有的一个格式，故其有很多优势，在读取数据方面，tf.records有着速度快的优势

1. tfRecord介绍

-> tf.train.Example.
-> tf.train.Features -> {"key": tf.train.Feature}.
-> tf.train.Feature -> tf.train.ByteList/FloatList/Int64List

favorite_books = [name.encode('utf-8') 
                 for name in ["machine learning", "cc150"]]

# ByteList
favorite_books_bytelist = tf.train.BytesList(value = favorite_books)
print(favorite_books_bytelist)

# FloatList
hours_floatlist = tf.train.FloatList(value = [15.5, 9.5, 7.0, 8.0])
print(hours_floatlist)

# Int64List
age_int64list = tf.train.Int64List(value = [42])
print(age_int64list)

# tf.trian.Features
# tf.train.Feature
features = tf.train.Features(
    feature = {
        "favorite_books": tf.train.Feature(
            bytes_list = favorite_books_bytelist),
        "hours": tf.train.Feature(
            float_list = hours_floatlist),
        "age": tf.train.Feature(int64_list = age_int64list)
    }
)

print(features)

# tf.train.example
# An Example is a mostly-normalized data format for storing data for training and inference.
example = tf.train.Example(features=features)
print(example)

通过序列化，对其进行压缩，以减少存储空间。

serialized_example = example.SerializeToString()
print(serialized_example)

2. tf.record文件

2.1 生成一个tfRecord文件

import os
output_dir = '/content/drive/MyDrive/data/tfrecord_data'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

filename = "test.tfrecords"
filename_fullpath = os.path.join(output_dir, filename)
with tf.io.TFRecordWriter(filename_fullpath) as writer:
    for i in range(3):
        writer.write(serialized_example)

2.2 读取tfRecord文件

dataset= tf.data.TFRecordDataset([filename_fullpath])
for serialized_example_tensor in dataset:
    print(serialized_example)

将序列化之后的example解析成序列化之前的example

# 定义一个字典，定义每一个feature所对应的类型
excepted_features = {
    "favorite_books": tf.io.VarLenFeature(dtype = tf.string),
    "hours": tf.io.VarLenFeature(dtype = tf.float32),
    "age": tf.io.FixedLenFeature([], dtype = tf.int64)
}

dataset = tf.data.TFRecordDataset([filename_fullpath])
for serialized_example_tensor in dataset:
    example = tf.io.parse_single_example(
        serialized_example_tensor,
        excepted_features)
    print(example)

由以上可以看出，生成的是spare_tensor. sparse_tensor在存稀疏矩阵时，效率比较高

解析 spare_tensor

for serialized_example_tensor in dataset:
    example = tf.io.parse_single_example(
        serialized_example_tensor,
        excepted_features)
    books = tf.sparse.to_dense(example["favorite_books"],
                               default_value=b"")
    for book in books:
        print(book.numpy().decode("utf-8"))

3. tfRecord压缩文件

3.1 将tfRecord 存成压缩文件

filename_zip = "test.tfrecords.zip"
filename_fullpath_zip = os.path.join(output_dir, filename_zip)
options = tf.io.TFRecordOptions(compression_type = "GZIP")
with tf.io.TFRecordWriter(filename_fullpath_zip, options) as writer:
    for i in range(3):
        writer.write(serialized_example)

读取存储的压缩文件

dataset_zip= tf.data.TFRecordDataset([filename_fullpath_zip], 
                                     compression_type="GZIP")
for serialized_example_tensor in dataset_zip:
    example = tf.io.parse_single_example(
        serialized_example_tensor,
        excepted_features)
    books = tf.sparse.to_dense(example["favorite_books"],
                               default_value=b"")
    for book in books:
        print(book.numpy().decode("utf-8"))

4. tfRecord实战

函数——从csv中读出dataset

import numpy as np
import functools

def parse_csv_line(line, n_fields):
    defs = [tf.constant(np.nan)] * n_fields
    parse_fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(parse_fields[0:-1])
    y = tf.stack(parse_fields[-1:])
    return x, y

# 使用functools.partial，把一个函数的某些参数给固定住(当然，也可以简单设定parse_csv_line中，n_fields=9)
parse_csv_line_9 = functools.partial(parse_csv_line, n_fields = 9)

def csv_reader_dataset(filenames, n_readers=5, batch_size=32, 
                       n_parse_threads=5, shuffle_buffer_size=10000):
    filename_dataset = tf.data.Dataset.list_files(filenames)
    filename_dataset = filename_dataset.repeat()
    dataset = filename_dataset.interleave(
        lambda filename: tf.data.TextLineDataset(filename).skip(1),
        cycle_length = n_readers
    )

    dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(parse_csv_line_9,
                          num_parallel_calls = n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset

函数——分别对tf.dataset进行遍历，把其数据写入到tf.record文件中

def serialized_example(x, y):
    """converts x, y, to tf.train.Example and serialize"""
    input_features = tf.train.FloatList(value = x)
    label = tf.train.FloatList(value = y)

    features = tf.train.Features(
        feature = {
            "input_features": tf.train.Feature(float_list = input_features),
            "label": tf.train.Feature(float_list = label)
        }
    )

    example = tf.train.Example(features = features)
    return example.SerializeToString()

函数——将csv dataset 写入到 tf records

def csv_dataset_to_tfrecords(base_filename_dir, dataset, n_shards, steps_per_shard, 
                             compression_type = None):
    """
    :parms base_filename: 
    :parms dataset: csv dataset
    :parms n_shards: 将dataset存成多少个文件
    :parms steps_per_shard: 对于每个小文件，应该在dataset走多少步。
                            因为在构建dataset的时候用了repeat，dataset的遍历永远不会结束，
                            故应该算一下去遍历多少步
    :parms compression_type: 压缩类型，比如"GZIP"，None，表示不压缩
    """
    if not os.path.exists(base_filename_dir):
        os.mkdir(base_filename_dir)
    options = tf.io.TFRecordOptions(compression_type = compression_type)
    all_filenames = []
    for shard_id in range(n_shards):
        filename = '{:05d}-of-{:05d}'.format(shard_id, n_shards)
        filename_fullpath = os.path.join(base_filename_dir, filename)
        with tf.io.TFRecordWriter(filename_fullpath, options) as writer:
            for x_batch, y_batch in dataset.take(steps_per_shard):
                for x_example, y_example in zip(x_batch, y_batch):
                    writer.write(serialized_example(x_example, y_example))
    
        all_filenames.append(filename_fullpath)
    return all_filenames

实战——（1）从csv文件中读数据，得到dataset

（2）将dataset数据存入到tfrecord文件，并返回tfrecord文件的文件名

n_shards = 20
batch_size = 32
train_steps_per_shard = 11610 // batch_size // n_shards
valid_steps_per_shard = 3880 // batch_size // n_shards
test_steps_per_shard = 5170 // batch_size // n_shards

output_dir = "/content/drive/MyDrive/data/generate_tfrecords"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

train_base_dir = os.path.join(output_dir, "train")
valid_base_dir = os.path.join(output_dir, "valid")
test_base_dir = os.path.join(output_dir, "test")

csv_dir = "/content/drive/MyDrive/data/generate_csv"

# 得到csv数据的训练集，验证集，测试集的文件列表
def get_filenames(dir, prefix):
    filenames_dir = os.path.join(csv_dir, "train")
    filenames_list = os.listdir(filenames_dir)
    return [os.path.join(filenames_dir, e) for e in filenames_list]


train_csv_filenames = get_filenames(csv_dir, "train")
valid_csv_filenames = get_filenames(csv_dir, "valid")
test_csv_filenames = get_filenames(csv_dir, "test")


train_set = csv_reader_dataset(train_csv_filenames, n_readers=5, batch_size=32, 
    n_parse_threads=5, shuffle_buffer_size=10000)
valid_set = csv_reader_dataset(valid_csv_filenames, n_readers=5, batch_size=32, 
    n_parse_threads=5, shuffle_buffer_size=10000)
test_set = csv_reader_dataset(test_csv_filenames, n_readers=5, batch_size=32, 
    n_parse_threads=5, shuffle_buffer_size=10000)


train_tfrecord_filenames = csv_dataset_to_tfrecords(
    train_base_dir, train_set, n_shards, train_steps_per_shard, None)
valid_tfrecord_filenames = csv_dataset_to_tfrecords(
    valid_base_dir, valid_set, n_shards, valid_steps_per_shard, None)
test_tfrecord_filenames = csv_dataset_to_tfrecords(
    test_base_dir, test_set, n_shards, test_steps_per_shard, None)

读取tf.record文件，得到dataset

expected_features= {
    "input_features": tf.io.FixedLenFeature([8], dtype=tf.float32),
    "label": tf.io.FixedLenFeature([1], dtype=tf.float32)
}

# 解析序列化的example
def parse_example(serialized_example):
    example = tf.io.parse_single_example(serialized_example, expected_features)

    return example["input_features"], example["label"]

def tfrecords_reader_dataset(filenames, n_readers=5, batch_size=32, 
                             n_parse_threads=5, shuffle_buffer_size=10000):
    dataset = tf.data.Dataset.list_files(filenames)
    dataset = dataset.repeat()
    dataset = dataset.interleave(
        lambda filename: tf.data.TFRecordDataset(filename, 
                                                 compression_type=None),
        cycle_length = n_readers
    )
    dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(parse_example, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset

举个简单的例子，来看一下我们存入的数据是否正确

tfrecords_train = tfrecords_reader_dataset(train_tfrecord_filenames,
                                           batch_size=3)
for x_batch, y_batch in tfrecords_train.take(2):
    print(x_batch)
    print(y_batch)

实战——利用tfrecords中的数据生成训练集，验证集，测试集

batch_size= 32
tfrecords_train_set = tfrecords_reader_dataset(train_tfrecord_filenames, 
                                               batch_size=batch_size)
tfrecords_valid_set = tfrecords_reader_dataset(valid_tfrecord_filenames, 
                                               batch_size=batch_size)
tfrecords_test_set = tfrecords_reader_dataset(test_tfrecord_filenames, 
                                               batch_size=batch_size)

训练模型，并测试

from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Dense(30, activation='relu',
                       input_shape=[8]),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(1),
])

model.compile(loss=keras.losses.MeanSquaredError(), 
              optimizer=keras.optimizers.Adam(learning_rate=1e-3),
              metrics=["accuracy"])

callbacks = [keras.callbacks.EarlyStopping(
    patience=5, min_delta=1e-2)]

history = model.fit(train_set,
                    validation_data = valid_set,
                    steps_per_epoch = 11160 // batch_size, # 11160 为训练集的样本数
                    validation_steps = 3870 // batch_size, # 3870 为验证集的样本数
                    epochs = 100,
                    callbacks = callbacks)

model.evaluate(test_set, steps = 5160 // batch_size) # 5160表示测试集的总样本的个数

zhao_crystal

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
tensorflow dataset 基础之——tfRecord

tf.record是tensorflow中独有的一个格式，故其有很多优势，在读取数据方面，tf.records有着速度快的优势1. tfRecord介绍-> tf.train.Example.-> tf.train.Features -> {"key": tf.train.Feature}.-> tf.train.Feature -> tf.train.ByteList/FloatList/Int64Listfavorite_books = [...
复制链接

扫一扫

专栏目录