Tensorflow加载和处理数据集——TFRecord格式

最新推荐文章于 2025-03-27 14:58:58 发布

流黄蛋

最新推荐文章于 2025-03-27 14:58:58 发布

阅读量1.1k

点赞数 1

分类专栏： Tensorflow 笔记文章标签： tensorflow 人工智能 python

本文链接：https://blog.csdn.net/Amazing_Fly/article/details/125172733

版权

笔记同时被 2 个专栏收录

9 篇文章

订阅专栏

Tensorflow

2 篇文章

订阅专栏

TFRecord格式还是非常重要的，它是Tensorflow首选的数据集处理格式，用于存储大量数据并有效读取数据。
特点：它是一种简单的二进制格式，只包含大小不同的二进制记录序列，每个记录由一个长度、一个用于检查长度是否损坏的CRC校验和、实际数据以及最后一个CRC校验和组成。

TFRecord文件的创建

使用 tf.io.TFRecordWriter

import tensorflow as tf

with tf.io.TFRecordWriter("my_data.tfrecord") as f:
f.writer(b"hello")
f.writer(b"word")

然后加载，使用tf.data.TFRecordDataset

tf_file = tf.data.TFRecordDataset(["my_contacts.tfrecord"])
for item in tf_file:
    print(item)

注：默认情况下， TFRecordDataset将一个接一个地读取文件，可以通过设置num_parallel_reads参数并行读取多个文件并交织记录。此外，还可以使用list_files()和interleave()执行操作。

创建压缩的TFRecord文件

创建压缩的TFRecord文件的好处，就是需要网络才能加载数据集的时候，可以更加方便的获取。
举个例子，怎么创建压缩的TFRecord文件：

options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    #执行写入操作

注意，上面使用的压缩形式当需要对文件读取的时候，需要使用对应的解压缩才可以：

dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")

协议缓冲区

TFRecord文件通常包含序列化的协议缓冲区(也成为protobufs)。这是一种可移植，可扩展且高效的二进制格式。它使用如下所示的语言定义：

syntax = "protos";
message Person{
string name = 1;
int32 id = 2;
repeated string email = 3;
}

使用代码说明下，Person protobuf怎么使用：

from person_pb2 import Person
person = Person(name="A1",
                id = 123,
                email=["a@b.com"]) # 创建Person
# 可以使用 “.”查看person中的每一个数据
# 如：
person.name
person.id
person.email

# 因为email是列表属性，那么还以使用列表的方法来追加，删除、插入、。。。操作
person.email.append("c@d.com")
# 可以使用SerializedTOString()对person进行序列化，其实就是为了接下来通过网络保存或者传输二进制数据
s = person.SerializeToString() # 这样就将数据转换成了二进制序列化形式，并存放在s中
# 那么怎么读取这个序列化数据呢
# 就要使用ParseFromString()来进行读取
person2 = Person() # 创建一个空Person
person2.ParseFromString(s) # 对s进行反序列化，那么得到的person2其实是和person一样的，应该是创建一个指针指向这个内存区域

协议缓冲区概念只作为知识补充，Tensorflow协议使用的主要是Example protobuf. 下面介绍Tensorflow协议

Tensorflow协议

TFRecord文件通常使用的主要protobuf是Example protobuf,表示数据集中的一个实例。
protobuf定义：

sysntax = "protox3";
message BytesList {repeated bytes value = 1;}
message FloatList {repeated float value = 1 [packed = true];} # [packed = true] 用于重复的数字字段以实现更有效的编码
message Int64List {repeated int64 value = 1 [packed = true];}
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
        }
};
message Features {map ,<string, Feature> feature = 1;};
message Example {Features fetures = 1;};

下面创建tf.train.Example的方法，该实例表示与先前相同的person并将其写入TFRecord文件：

person_example = Example(
    features = Features(
        feature={
            "name":Feature(bytes_list=BytesList(value=[b"Alice"])), 
            "id":Feature(int64_list=Int64List(value=[123])),
            "emails":Feature(bytes_list=BytesList(value=[b"a@b.com",
                                                         b"c@d.com"]))
        }))

注意：BytesList可以存放任何需要的二进制文件，包括任何序列化的对象。
例如，可以使用tf.io.encode_jpeg()对JPEG格式的图像进行编码，然后将二进制数据放在BytesList中。然后在解析Example后调用tf.io.decode_image()来解码任何BMP、GIF、JPEG或PNG图像。还可以通过tf.io.serialize_tensor()来序列化张量，存储在BytesList中，然后使用tf.io.parse_tensor()解析张量。
然后将上面的示例写入到tfrecord文件中：

 with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
        f.write(person_example.SerializeToString())

保存好以后就要考虑如何将数据集加载和解析出来
下面提供加载和解析上面Example的方法
同样要先使用tf.data.TFRecordDataset()来加载tfrecord文件，然后使用tf.io.parse_single_example()来解析tfrecord文件中的每个Example. 这是一个Tensorflow操作，因此可以包含在TF函数中。其至少包含两个参数：一个是包含序列化数据的字符串标量张量，还有一个是对特征的描述。特征描述是一个字典，将每个特征名称映射到表示特征形状、类型和默认值的tf.io.FixedLenFeature描述符，或者仅表示类型的tf.io.VarLenFeature描述符。下面定义一个描述字典，然后通过遍历TFRecord Dataset并解析该数据集包含的序列化的Example protobuf:

feature_description ={
    "name":tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id":tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string),
}

# 以下解析单个用例
for example in tf_file:
    parse_example = tf.io.parse_single_example(example, feature_description)  # 这个只能解析单个用例，还可以使用      tf.io.parse_example直接解析一个batch的用例
    dense_tensor = tf.sparse.to_dense(parse_example["emails"], default_value=b"")
    print(dense_tensor)

固定长度特征被解析为规则张量，而可变长度特征被解析为稀疏张量。可以使用tf.sparse.to_dense()将稀疏张量转换为密集张量。
上面的解析是将用例一个一个解析出来，其实还可以使用tf.io.parse_example()一个批次一个批次的解析Example,如下所示

# 以下代码可以解析批量用例
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(2)
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples, feature_description)
    print(parsed_examples)

能处理列表的列表的 SequenceExample Protobuf

前面的Example protobuf虽然能够处理大部分用例，但是当处理多维的列表数据时，使用它会比较麻烦。比如文本文档数据，是一个三维的列表形式，[[[], [], [], …, []],
[[], [], [], …, []],
[[], [], [], …, []],
…,
[[], [], [], …, []]]
SequenceExample Protobuf就可以方便的处理这种数据集：
以下是其定义：

message FeatureList {repeated Feature feature = 1;};
message FeatureLists {map <string, FeatureList> feature_list = 1;};
message SequenceExample {
       Features context = 1; # 上下文feture对象
       FeatureLists feature_lists = 2;
}；

以上是存储协议这块的内容