TensorFlow的数据pipeline系列：使用dataset结合Example和SequenceExample协议的解析方法比较（四）

最新推荐文章于 2022-10-18 15:17:56 发布

LoveMIss-Y

最新推荐文章于 2022-10-18 15:17:56 发布

阅读量3.5k

点赞数 5

分类专栏： TensorFlow 文章标签： tensorflow pytorch 深度学习机器学习

本文链接：https://blog.csdn.net/qq_27825451/article/details/105099454

版权

TensorFlow 专栏收录该内容

28 篇文章 15 订阅

订阅专栏

前言：本文详细介绍tf.train.Example和tf.train.SequenceExample的区别，前面的几篇文章参见：

tensorflow的Data Pipeline系列教程（一）——Dataset类的属性即常用方法

TensorFlow的数据pipeline系列：Datasets+TFRecord的数据导入（二）

TensorFlow之tfrecords文件详细教程

TensorFlow的数据pipeline系列：tf.train.Example和tf.train.SequenceExample协议的比较（三）

一、如何查看tfrecord文件所包含的信息

1.1 tfrecords文件的简单预览——以系列文章（三）中得到的example.tfrecord文件为例子

我们可以简单的查看一下我们所保存的tfrecords文件是否符合我们的预期，我们希望预览一下每一组样本中包含的特征信息，我们可以这么做，在tensorflow1.x和tensorflow2.x中有不同的实现。

（1）tensorflow1.x的实现

我们可以使用tf.train.Example.FromString()进行简单的查看，代码如下：

import tensorflow as tf
 
#确认tfrecord的内容
ex=next(tf.python_io.tf_record_iterator('titanic_train.tfrecords'))
print(tf.train.Example.FromString(ex))

（2）tensorflow2.x的实现——通过dataset直接完成

dataset = tf.data.TFRecordDataset("example.tfrecord")  # 将record文件加载成dataset
print(dataset.element_spec)  # 查看dataset的“元素element”信息，得到TensorSpec(shape=(), dtype=tf.string, name=None)

    
for raw_record in dataset.take(1):  # 从dataset中取一个样本进行查看即可
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

得到的结果如下：

features {
  feature {
    key: "X"
    value {
      float_list {
        value: 1.0
        value: 2.0
        value: 3.0
        value: 4.0
      }
    }
  }
  feature {
    key: "Y"
    value {
      int64_list {
        value: 1
      }
    }
  }
}

从上面的解析中可以看见，前面的tfrecord文件中的数据结构清楚的展示了出来。

二、tfrecord文件的解析

2.1 解析的思想

实际上经过保存的tfrecord文件已经是一个dataset了，他跟我们的列表没有任何区别，所以可以很方便的使用dataset直接进行加载，但是需要注意的，由于tfrecord的每一个样本example，也就是每一个element都是经过序列化的，我们像常规的数据迭代只能得到序列化的样本，即serielized_example，所以数据就需要从serielized_example中进行解析。

解析的实质本质上依然是迭代dataset元素，然后对每一个元素应用map函数，，map函数的参数是一个解析函数，这个解析函数需要我们自己进行编写，总而言之，tfrecord文件的解析最核心的地方在于解析函数的编写。

下面依然以上面的example.tfrecord文件为例子来说明：

2.2 直接上代码——以上面的example.tfrecord文件为例子

（1）一次性只解析一条样本——tf.io.parse_single_example

定义的解析函数如下：

def parse_tfrecords(serialized_example):
    # 定义解析的规则，需要注意的是，这里需要与数据保存是后定义的规则一致
    features = {
            'X':tf.io.FixedLenFeature([4], dtype=tf.float32),
            'Y': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }
    
    # 一次仅仅解析一条样本example
    example = tf.io.parse_single_example(serialized_example, features=features)
    
    return example['X'], example['Y']

下面获取整个数据

tfrecords_dataset = tf.data.TFRecordDataset("example.tfrecord")
tfrecords_dataset = tfrecords_dataset.map(parse_tfrecords)  # 其实就是解析没一个样本
for feature, label in tfrecords_dataset:
    print(feature, label)
    print("----------------------------------------")

'''运行结果如下：
tf.Tensor([1. 2. 3. 4.], shape=(4,), dtype=float32) tf.Tensor([1], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor([5. 6. 7. 8.], shape=(4,), dtype=float32) tf.Tensor([2], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor([11. 12. 13. 14.], shape=(4,), dtype=float32) tf.Tensor([3], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor([20. 21. 22. 23.], shape=(4,), dtype=float32) tf.Tensor([4], shape=(1,), dtype=int64)
----------------------------------------
(python3.7.4) facepro@huaxin:/media/huaxin/tcl3/facepro/hand-gesture-recognition/jester-data-prepr
'''

（2）一次性解析多条样本——tf.io.parse_exsample()

直接看代码：

def parse_tfrecords(serialized_example):
    # 定义解析的规则，需要注意的是，这里需要与数据保存是后定义的规则一致
    features = {
            'X':tf.io.FixedLenFeature([4], dtype=tf.float32),
            'Y': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }
    
    # 一次解析多个样本example，仅仅是这里的一句话不同而已
    example = tf.io.parse_example(serialized_example, features=features)
    
    return example['X'], example['Y']

解析的结果如下：

tfrecords_dataset = tf.data.TFRecordDataset("example.tfrecord")

# 一次解析多条样本可以使用batch方法，实际上就是一次解析一个batch的样本数据
tfrecords_dataset = tfrecords_dataset.repeat(3).batch(5) # 重复数据集
     
tfrecords_dataset = tfrecords_dataset.map(parse_tfrecords)  # 其实就是解析没一个样本
    
for feature, label in tfrecords_dataset:
    print(feature, label)
    print("----------------------------------------")

'''运行结果为：
tf.Tensor(
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [11. 12. 13. 14.]
 [20. 21. 22. 23.]
 [ 1.  2.  3.  4.]], shape=(5, 4), dtype=float32) tf.Tensor(
[[1]
 [2]
 [3]
 [4]
 [1]], shape=(5, 1), dtype=int64)
----------------------------------------
tf.Tensor(
[[ 5.  6.  7.  8.]
 [11. 12. 13. 14.]
 [20. 21. 22. 23.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]], shape=(5, 4), dtype=float32) tf.Tensor(
[[2]
 [3]
 [4]
 [1]
 [2]], shape=(5, 1), dtype=int64)
----------------------------------------
tf.Tensor(
[[11. 12. 13. 14.]
 [20. 21. 22. 23.]], shape=(2, 4), dtype=float32) tf.Tensor(
[[3]
 [4]], shape=(2, 1), dtype=int64)
----------------------------------------
'''

总结：

从结果上来看，共四组样本，重复3次，一共12组样本，每一次解析一个batch=5的，最后一次只剩两组样本，从结果来看完全正确。

2.3 tensorflow2.x中的四种解析方式

tf.io.parse_example(serialized, features, example_names=None, name=None)

tf.io.parse_single_example(serialized, features, example_names=None, name=None)

# 下面是针对序列所组成的example而言的
tf.io.parse_sequence_example(serialized, context_features=None, sequence_features=None, example_names=None,name=None)

tf.io.parse_single_sequence_example(serialized, context_features=None, sequence_features=None, example_name=None,name=None)

共性总结：

参数一：serialized，实际上就是表示的是一个序列化的样本，即dataset中的一个序列化的元素element；
参数二：features，实际上是一个解析字典，这个字典是最核心的，定义了序列化的数据解析的规则，否则解析会失败

对于sequence example而言

参数一：serialized，同上面
参数二：context_features，实际上是针对序列中的定长特征而言的，一般如标签label
参数三：sequence_features，实际上针对的是不定长feature而言的，这两个的具体含义可以参考前面一篇文章

另外，带有single的一次只能解析一条样本，即解析一个元素，这就不能对dataset进行batch操作，

不带有single的可以一次性解析多条样本，可以对dataset进行batch操作，一次性结息一个batch的样本

2.4 tfrecord四种解析方式的返回值

虽然在tfrecord上面的解析是大致相同的，但是每一个解析函数的返回值是不一样的，其中对于example的单个解析与批量解析代码几乎一样，只在于一个可以使用batch，一个不能使用，他们均只返回一个字典，如下：

example_dict = tf.io.parse_example(serialized, features, example_names=None, name=None)

example_dict = tf.io.parse_single_example(serialized, features, example_names=None, name=None)

注意：

返回的是一个dict字典，表示的是每一个Feature的key与value的映射，只需要按需求取得value即可。

但是对于sequence_example的单个解析和批量解析返回值是不一样的，一个返回两个字典，另一个返回三个字典，参见下面：

context_example_dict, sequence_example_dict  = tf.io.parse_single_sequence_example(serialized_example,context_features = context_features,sequence_features = sequence_features)


context_example_dict, sequence_example_dict, other_dict  = tf.io.parse_sequence_example(serialized_example,context_features = context_features,sequence_features = sequence_features)

注意：

前两个字典是一样的含义

context_example_dict :表示的是对于固定大小的context_feature的key与value的映射
sequence_example_dict :表示的是对于不固定大小的sequence_feature的key与value的映射

另外，批量解析的第三个字典

final dict contains the lengths of any dense feature_list features.

三、解析函数的编写

我们实际上不一定非要编写解析函数，由于序列化元素的解析是通过dataset.map(func)来完成的，我们完全看可以在解析的时候通过lambda表达式来完成，但是由于我们一般需要保存成tfrecord文件的话都是比较复杂的数据，通过lambda表达式自然不好解析，因此我们最好是编写一个解析函数，解析函数的一般模板如下：

3.1 解析函数四不走原则

遵 循解析函数四步走 原则

（1）对于规则的example而言

def parse_tfrecords(serialized_example):
    # 第一步：需要有一个参数，这个参数实际上代表的意思就是每一次从dataset中取出的一个或者是一个batch的序列化的样本

    # 第二步：定义解析的规则，需要注意的是，这里需要与数据保存是后定义的规则一致
    features = {
            'X':tf.io.FixedLenFeature([4], dtype=tf.float32),
            'Y': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }
    
    # 第三步：解析一条或者是一个batch的样本
    example = tf.io.parse_example(serialized_example, features=features)
    
    # 第四步：返回解析之后的样本数据
    return example['X'], example['Y']

（2）对于不规则的sequence_example而言

依然是遵循解析的四步走原则，唯一不同的是，，我需要定义两个解析规则，分别包含

context_feature={}

sequence_feature={}

3.2 四步走的核心——定义解析字典（数据的两种解析方式）

核心就是定义一个字典，这个字典的键key要与写入tfrecord是的key保持一致，然后定义一个value存储从中取得的值，有两种存储方式，一是定长存储，另一种是变长存储。

（1）定长特征解析：tf.io.FixedLenFeature

即如下形式：

features = {
            'key1': tf.io.FixedLenFeature([4], dtype=tf.float32),
            'key2': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }

函数原型如下：

tf.io.FixedLenFeature(shape, dtype, default_value)

shape：即解析的这个value的形状，一般保持原始形状即可，，但是同也可以作为reshape，改变原来的形状

可当reshape来用，如vector的shape从(3,)改动成了(1,3)。注：如果写入的feature使用了.tostring() 其shape就是()

dtype：即原来数据的数据格式，必须是tf.float32， tf.int64， tf.string中的一种。
default_value：feature值缺失时所指定的值。

（2）不定长特征解析：tf.io.VarLenFeature(dtype)

一般格式如下：

features = {
            'key1': tf.io.VarLenFeature(dtype=dtype('float32')),
            'key2': tf.io.VarLenFeature(dtype=dtype('float32'))
}

函数原型如下：

tf.io.VarLenFeature(dtype)

它只有一个参数，即数据的类型，同样需要为必须是tf.float32， tf.int64， tf.string中的一种，

由于本身是存储不定长数据的，就没有shape这一个概念了。

特别注意：由于变长特征没有指定shape，但得到的tensor是SparseTensor。

3.3 解析结果的变换处理

根据前面的说明，每一次通过tf.io.parse_xxx()解析出来的样本依然是一个字典类型，

example = tf.io.parse_example(serialized_example, features=features)

即得到的example也是一个字典，其中每个key是对应feature的名字，value是相应的feature解析值。

如果使用了下面两种情况，则还需要对这些值进行转变。其他情况则不用。

（1）第一：将数据转化成了string，即使用了.tostring()

string类型：tf.decode_raw(parsed_feature, type) 来解码，注：这里type必须要和当初.tostring()化前的一致。如tensor转变前是tf.uint8，这里就需是tf.uint8；转变前是tf.float32，则tf.float32

（2）第二：即使用变长解析得到的sparse_tensor
VarLen解析：由于得到的是SparseTensor，所以视情况需要用

tf.sparse_tensor_to_dense(SparseTensor)来转变成DenseTensor——tensorflow1.x

tft.sparse.to_dense(sparseTensor)——tensorflow2.x

3.4 解析字典的编写依据

（1）依据一

由于存储tfrecord文件的时候就已经定义了存储规则，key的名称是什么，value的形状是什么，只需要按照这个规则再编写解析字典即可

（2）依据二

如果我本身不是数据的保存着，不也不知道tfrecord里面到地方了一些什么数据，我可以采用本文第一章里面的方法，先拿出来一个样本，查看他的概要信息，然后根据信息编写解析字典。

四、SequenceExample的数据解析——参照上一篇文章的sequence_example.tfrecord

直接上代码了

def parse_tfrecords(serialized_example):
    # 定义解析的规则，需要注意的是，这里需要与数据保存是后定义的规则一致
    context_features = {
            'Y': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }
    
    sequence_features = {
            'X': tf.io.VarLenFeature(dtype=tf.float32) 
            }
    
    # 一次仅仅解析一条样本example，返回解析的两个字点
    context_example, sequence_example = tf.io.parse_single_sequence_example(serialized_example,context_features = context_features,sequence_features = sequence_features)
    
    # 由于sequence数据是变长的，得到的是稀疏数据，即这里的sequence_example['X']是稀疏矩阵
    # 需要将稀疏矩阵转化成周密矩阵，在tf2.x版本如下
    # 具体参见sparse tensor到dense tensor的转化
    dense_feature = tf.sparse.to_dense(sequence_example['X'])
    
    return  dense_feature,context_example['Y']

    

if __name__ == "__main__":
    # generate_tfrecords()
    
    tfrecords_dataset = tf.data.TFRecordDataset("sequence_example.tfrecord")
  
    tfrecords_dataset = tfrecords_dataset.map(parse_tfrecords)  # 其实就是解析没一个样本
    
    for feature, label in tfrecords_dataset:
        print(feature.numpy(), label, sep="       ")
        print("----------------------------------------")

运行结果如下：

tf.Tensor([[1.]], shape=(1, 1), dtype=float32)       tf.Tensor([1], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[2.]
 [2.]], shape=(2, 1), dtype=float32)       tf.Tensor([2], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[3.]
 [3.]
 [3.]], shape=(3, 1), dtype=float32)       tf.Tensor([3], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[4.]
 [4.]
 [4.]
 [4.]], shape=(4, 1), dtype=float32)       tf.Tensor([4], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[5.]
 [5.]
 [5.]
 [5.]
 [5.]], shape=(5, 1), dtype=float32)       tf.Tensor([5], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor([[1.]], shape=(1, 1), dtype=float32)       tf.Tensor([1], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[2.]
 [2.]
 [3.]], shape=(3, 1), dtype=float32)       tf.Tensor([2], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[4.]
 [4.]
 [4.]
 [4.]], shape=(4, 1), dtype=float32)       tf.Tensor([3], shape=(1,), dtype=int64)
----------------------------------------
tf.Tensor(
[[6.]
 [7.]
 [8.]], shape=(3, 1), dtype=float32)       tf.Tensor([4], shape=(1,), dtype=int64)
----------------------------------------

这与前面一篇文章的数据是完全吻合的，说明解析正确了

4.2 批量解析——参见前一篇文章的sequence_example.tfrecord

如下底代码：

def parse_tfrecords(serialized_example):
    # 定义解析的规则，需要注意的是，这里需要与数据保存是后定义的规则一致
    context_features = {
            'Y': tf.io.FixedLenFeature([1], dtype=tf.int64) 
            }
    
    sequence_features = {
            'X': tf.io.VarLenFeature(dtype=tf.float32) 
            }
    
    # 一次仅仅解析一条样本example，返回解析的  “三个”  字点，这里需要尤其注意，否则会出错
    context_example, sequence_example , _ = tf.io.parse_sequence_example(serialized_example,context_features = context_features,sequence_features = sequence_features)
    
    # 由于sequence数据是变长的，得到的是稀疏数据，即这里的sequence_example['X']是稀疏矩阵
    # 需要将稀疏矩阵转化成周密矩阵，在tf2.x版本如下
    # 具体参见sparse tensor到dense tensor的转化
    dense_feature = tf.sparse.to_dense(sequence_example['X'])
    
    return  dense_feature,context_example['Y']

    

if __name__ == "__main__":
    # generate_tfrecords()
        
    tfrecords_dataset = tf.data.TFRecordDataset("sequence_example.tfrecord")
    tfrecords_dataset = tfrecords_dataset.repeat(3).batch(3) # 批量读取
    
    tfrecords_dataset = tfrecords_dataset.map(parse_tfrecords)  # 其实就是解析没一个样本
    
    for feature, label in tfrecords_dataset:
        print(feature,label, sep="       ")
        
        print("---------------------------------------------------------")

运行结果如下：

tf.Tensor(
[[[1.]
  [0.]
  [0.]]

 [[2.]
  [2.]
  [0.]]

 [[3.]
  [3.]
  [3.]]], shape=(3, 3, 1), dtype=float32)       tf.Tensor(
[[1]
 [2]
 [3]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[4.]
  [4.]
  [4.]
  [4.]
  [0.]]

 [[5.]
  [5.]
  [5.]
  [5.]
  [5.]]

 [[1.]
  [0.]
  [0.]
  [0.]
  [0.]]], shape=(3, 5, 1), dtype=float32)       tf.Tensor(
[[4]
 [5]
 [1]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[2.]
  [2.]
  [3.]
  [0.]]

 [[4.]
  [4.]
  [4.]
  [4.]]

 [[6.]
  [7.]
  [8.]
  [0.]]], shape=(3, 4, 1), dtype=float32)       tf.Tensor(
[[2]
 [3]
 [4]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[1.]
  [0.]
  [0.]]

 [[2.]
  [2.]
  [0.]]

 [[3.]
  [3.]
  [3.]]], shape=(3, 3, 1), dtype=float32)       tf.Tensor(
[[1]
 [2]
 [3]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[4.]
  [4.]
  [4.]
  [4.]
  [0.]]

 [[5.]
  [5.]
  [5.]
  [5.]
  [5.]]

 [[1.]
  [0.]
  [0.]
  [0.]
  [0.]]], shape=(3, 5, 1), dtype=float32)       tf.Tensor(
[[4]
 [5]
 [1]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[2.]
  [2.]
  [3.]
  [0.]]

 [[4.]
  [4.]
  [4.]
  [4.]]

 [[6.]
  [7.]
  [8.]
  [0.]]], shape=(3, 4, 1), dtype=float32)       tf.Tensor(
[[2]
 [3]
 [4]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[1.]
  [0.]
  [0.]]

 [[2.]
  [2.]
  [0.]]

 [[3.]
  [3.]
  [3.]]], shape=(3, 3, 1), dtype=float32)       tf.Tensor(
[[1]
 [2]
 [3]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[4.]
  [4.]
  [4.]
  [4.]
  [0.]]

 [[5.]
  [5.]
  [5.]
  [5.]
  [5.]]

 [[1.]
  [0.]
  [0.]
  [0.]
  [0.]]], shape=(3, 5, 1), dtype=float32)       tf.Tensor(
[[4]
 [5]
 [1]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------
tf.Tensor(
[[[2.]
  [2.]
  [3.]
  [0.]]

 [[4.]
  [4.]
  [4.]
  [4.]]

 [[6.]
  [7.]
  [8.]
  [0.]]], shape=(3, 4, 1), dtype=float32)       tf.Tensor(
[[2]
 [3]
 [4]], shape=(3, 1), dtype=int64)
--------------------------------------------------------------------------------------

有前面的数据可知，解析结果完全正确。