TFRecord-TensorFlow 数据集存储格式

最新推荐文章于 2024-05-07 14:03:48 发布

AI-创造美好未来

最新推荐文章于 2024-05-07 14:03:48 发布

阅读量256

点赞数

分类专栏：人工智能 python算法 tensorflow 文章标签： tensorflow 深度学习机器学习

本文链接：https://blog.csdn.net/weixin_53226226/article/details/122637762

版权

人工智能同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

tensorflow

7 篇文章 0 订阅

订阅专栏

python算法

4 篇文章 0 订阅

订阅专栏

TFRecord 是 TensorFlow 中的数据集存储格式。

TensorFlow 就可以高效地读取和处理这些数据集，从而帮助我们更高效地进行大规模的模型训练。

格式：TFRecord 可以理解为一系列序列化的 tf.train.Example 元素所组成的列表文件;
而每一个 tf.train.Example 又由若干个 tf.train.Feature 的字典

[
    {   # example 1 (tf.train.Example)
        'feature_1': tf.train.Feature,
        ...
        'feature_k': tf.train.Feature
    },
    ...
    {   # example N (tf.train.Example)
        'feature_1': tf.train.Feature,
        ...
        'feature_k': tf.train.Feature
    }
]
# 字典结构如
feature = {
                'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))  # 标签是一个 Int 对象
            }

保存TFRecord

为了将形式各样的数据集整理为 TFRecord 格式，我们需要对数据集中的每个元素进行以下步骤：

1、读取该数据元素到内存
2、将该元素转换为 tf.train.Example 对象（每一个 tf.train.Example 由若干个 tf.train.Feature 的字典组成，因此需要先建立 Feature 的字典）；
3、将该 tf.train.Example 对象序列化为字符串，并通过一个预先定义的 tf.io.TFRecordWriter 写入 TFRecord 文件。

读取 TFRecord 数据

则可按照以下步骤：

1、通过 tf.data.TFRecordDataset 读入原始的 TFRecord 文件（此时文件中的 tf.train.Example 对象尚未被反序列化），获得一个 tf.data.Dataset 数据集对象；
2、通过 Dataset.map 方法，对该数据集对象中的每一个序列化的 tf.train.Example 字符串执行 tf.io.parse_single_example 函数，从而实现反序列化。

读取 TFRecord 数据

则可按照以下步骤：

1、通过 tf.data.TFRecordDataset 读入原始的 TFRecord 文件（此时文件中的 tf.train.Example 对象尚未被反序列化），获得一个 tf.data.Dataset 数据集对象；
2、通过 Dataset.map 方法，对该数据集对象中的每一个序列化的 tf.train.Example 字符串执行 tf.io.parse_single_example 函数，从而实现反序列化。

import os
import tensorflow as tf
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

train_cats_dir = './train/cats/'
train_dogs_dir = './train/dogs/'
tfrecord_file = './train.tfrecords'

train_cat_filenames = [train_cats_dir + filename for filename in os.listdir(train_cats_dir)]
train_dog_filenames = [train_dogs_dir + filename for filename in os.listdir(train_dogs_dir)]
train_filenames = train_cat_filenames + train_dog_filenames
train_labels = [0] * len(train_cat_filenames) + [1] * len(train_dog_filenames)

迭代读取每张图片，建立 tf.train.Feature 字典和 tf.train.Example 对象，序列化并写入 TFRecord 文件。

    with tf.io.TFRecordWriter(tfrecord_file) as writer:
        for filename, label in zip(train_filenames, train_labels):
            # 1、读取数据集图片到内存，image 为一个 Byte 类型的字符串
            image = open(filename, 'rb').read()
            # 2、建立 tf.train.Feature 字典
            feature = {
                'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))  # 标签是一个 Int 对象
            }
            # 3、通过字典建立 Example
            example = tf.train.Example(features=tf.train.Features(feature=feature))
            # 4\将Example序列化并写入 TFRecord 文件
            writer.write(example.SerializeToString())

读取 TFRecord 文件

我们可以通过以下代码，读取之间建立的 train.tfrecords 文件，并通过 Dataset.map 方法，使用 tf.io.parse_single_example 函数对数据集中的每一个序列化的 tf.train.Example 对象解码。

# 1、读取 TFRecord 文件
    raw_dataset = tf.data.TFRecordDataset(tfrecord_file)

    # 2、定义Feature结构，告诉解码器每个Feature的类型是什么
    feature_description = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([], tf.int64),
    }

    # 3、将 TFRecord 文件中的每一个序列化的 tf.train.Example 解码
    def _parse_example(example_string):
        feature_dict = tf.io.parse_single_example(example_string, feature_description)
        feature_dict['image'] = tf.io.decode_jpeg(feature_dict['image'])  # 解码JPEG图片
        return feature_dict['image'], feature_dict['label']

    dataset = raw_dataset.map(_parse_example)

    for image, label in dataset:
        print(image, label)