Google老师亲授 TensorFlow2.0入门到进阶笔记- (dataset)

一杯敬朝阳一杯敬月光

于 2021-10-08 00:26:31 发布

阅读量594

点赞数

分类专栏： TensorFlow 文章标签： tensorflow 深度学习神经网络

本文链接：https://blog.csdn.net/qq_xuanshuang/article/details/120642719

版权

TensorFlow 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

版本：numpy 1.16.6 tensorflow 2.2.0 tensorflow.keras 2.3.0-tf

1. 引入

DataSet基础使用
- tf.data.Dataset.from_tensor_slices # 构建dataset
- repeat, batch, interleave, map, shuffle, list_files,...
csv
- tf.data.TextLineDataset, # 读取文本文件
- tf.io.decode_csv # 解析csv
tfrecord
- tf.train.FloatList, tf.train.Int64List, tf.train.BytesList
- tf.train.Feature, tf.train.Features, tf.train.Example # 封装tfexample写到文件中去
- example.SerializeToString # 序列化
- tf.io.ParseSingleExample # 解析一个具体的tf example
- tf.io.VarLenFeature, tf.io.FixedLenFeature
- tf.data.TFRecoredDataset, tf.io.TFRecoredOptions

2. 基础API使用

2.1 从内存中构建数据

从内存中构建数据集，参数可以是普通的列表、numpy的一个数组、元组或字典，其中元组形如(x,y)，字典形如{key1: x, key2:y}，其中x和y的第一个维度需要相同

# 普通的列表
# TensorSliceDataset shapes: (), types: tf.int32
dataset = tf.data.Dataset.from_tensor_slices(list(range(10)))
# numpy数组
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))

x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
# 元祖
# TensorSliceDataset shapes: ((2,), ()), types: (tf.int64, tf.string)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
# 字典
# TensorSliceDataset 
# shapes: {feature: (2,), label: ()}, 
# types: {feature: tf.int32, label: tf.string}
dataset = tf.data.Dataset.from_tensor_slices({'feature': x, 'label': y})

2.2 遍历数据

列表 or numpy数组

dataset = tf.data.Dataset.from_tensor_slices(list(range(10)))
for item in dataset:
    print(item)
    print(item.shape, type(item))
    print(item.numpy())
    print()

其中，dataset的类型是：class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'，里面的每一个元素的类型是：class 'tensorflow.python.framework.ops.EagerTensor'

输出：形如：

tf.Tensor(0, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
0

tf.Tensor(1, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
1
.
.
.
tf.Tensor(9, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
9

元组

x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset = tf.data.Dataset.from_tensor_slices((x, y))
for item in dataset:
    print(item)
    print("=" * 40)
    break
    
for item_x, item_y in dataset:
    print(item_x)
    print(item_y)
    print("=" * 20)

输出：

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>)
========================================
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
====================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
====================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
====================

字典

x = [[1, 2], [3, 4], [5, 6]]
y = ['cat', 'dog', 'fox']
dataset = tf.data.Dataset.from_tensor_slices({'feature': x, 'label': y})
for item in dataset:
    print(item)
    print(item['feature'])
    print(item['label'])
    print("=" * 20)

输出：

{'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>}
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
====================
{'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>}
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
====================
{'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([5, 6], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'fox'>}
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
====================

2.2 repeat

重复

遍历：repeat将原数据重复指定次数，在遍历的时候，每个元素的类型同未repeat的时候一致，只不过元素的数目变多了，是原来数据的制指定次数倍。

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
dataset = dataset.repeat(3)
print(dataset)
print(type(dataset))

输出：

<RepeatDataset shapes: (), types: tf.int64>
<class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>

dataset = tf.data.Dataset.from_tensor_slices({'feature': x, 'label': y})
dataset = dataset.repeat(2)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item['feature'])
    print(item['label'])
    print("=" * 40)

输出：

<RepeatDataset shapes: {feature: (2,), label: ()}, types: {feature: tf.int32, label: tf.string}>
<class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'> 
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
========================================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
========================================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
========================================
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
========================================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
========================================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
========================================

2.3 batch

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
dataset = dataset.repeat(3).batch(7, drop_remainder=True)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item)
    print(item.shape, type(item))
    print(item.numpy())
    print()

<BatchDataset shapes: (7,), types: tf.int64>
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[0 1 2 3 4 5 6]

tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[7 8 9 0 1 2 3]

tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[4 5 6 7 8 9 0]

tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[1 2 3 4 5 6 7]

dataset = tf.data.Dataset.from_tensor_slices({'feature': x, 'label': y})
dataset = dataset.repeat(2).batch(5)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item['feature'])
    print(item['label'])
    print(type(item), type(item['feature']), type(item['label']))
    print("=" * 40)

输出：

<BatchDataset shapes: {feature: (None, 2), label: (None,)}, types: {feature: tf.int32, label: tf.string}>
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
tf.Tensor(
[[1 2]
 [3 4]
 [5 6]
 [1 2]
 [3 4]], shape=(5, 2), dtype=int32)
tf.Tensor([b'cat' b'dog' b'fox' b'cat' b'dog'], shape=(5,), dtype=string)
<class 'dict'> <class 'tensorflow.python.framework.ops.EagerTensor'> <class 'tensorflow.python.framework.ops.EagerTensor'>
========================================
tf.Tensor([[5 6]], shape=(1, 2), dtype=int32)
tf.Tensor([b'fox'], shape=(1,), dtype=string)
<class 'dict'> <class 'tensorflow.python.framework.ops.EagerTensor'> <class 'tensorflow.python.framework.ops.EagerTensor'>
========================================

2.4 interleave

interleave: 对现有dataset中的每一个元素做处理，每个元素做完处理会产生一个新的结果，interleave会把这些新的结果合并起来，形成一个新的数据集。
case: 例如现有的dataset里面存入的是一系列的文件名，用interleave去做一个变化，遍历文件名数据集中的所有元素集文件名，把文件名对应的文件的内容读取出来，这样每个文件名都形成新的数据集，interleave把新的数据集合并起来，成为一个总的大数据集。

几个关键的参数：