TensorFlow2中tf.data.Dataset对象的使用(常用函数总结)

鹏阿鹏

已于 2022-07-06 17:15:05 修改

阅读量2.4k

点赞数 4

分类专栏： Python TensorFlow2 文章标签： python tensorflow 数据分析深度学习

于 2022-04-09 22:27:48 首次发布

本文链接：https://blog.csdn.net/AwesomeP/article/details/124069563

版权

Python 同时被 2 个专栏收录

27 篇文章 8 订阅

订阅专栏

TensorFlow2

16 篇文章 4 订阅

订阅专栏

文章目录

tf.data.Dataset接口是一个生成Dataset数据的高级借口，在对于大型数据集的处理中有很大帮助，同时这也是官方推荐使用的数据处理方式。

1 导包

import tensorflow as tf
import numpy as np

2 Dataset数据创建

1 从列表中创建

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7])
print(dataset)
"""输入如下
<TensorSliceDataset shapes: (), types: tf.int32>
"""

遍历数据

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7])
for ele in dataset:
    print(ele,"   ",ele.numpy())
"""
tf.Tensor(1, shape=(), dtype=int32)     1
tf.Tensor(2, shape=(), dtype=int32)     2
tf.Tensor(3, shape=(), dtype=int32)     3
tf.Tensor(4, shape=(), dtype=int32)     4
tf.Tensor(5, shape=(), dtype=int32)     5
tf.Tensor(6, shape=(), dtype=int32)     6
tf.Tensor(7, shape=(), dtype=int32)     7
 """

从多维列表中创建

dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4], [5, 6]])
for ele in dataset:
    print(ele.numpy())
"""
[1 2]
[3 4]
[5 6]
"""

从字典中创建

dataset_dic = tf.data.Dataset.from_tensor_slices({'a': [1,2,3,4],
                                                  'b': [6,7,8,9],
                                                  'c': [12,13,14,15]
    
})
for ele in dataset_dic:
    print(ele)

{'a': <tf.Tensor: id=60, shape=(), dtype=int32, numpy=1>, 'b': <tf.Tensor: id=61, shape=(), dtype=int32, numpy=6>, 'c': <tf.Tensor: id=62, shape=(), dtype=int32, numpy=12>}
{'a': <tf.Tensor: id=66, shape=(), dtype=int32, numpy=2>, 'b': <tf.Tensor: id=67, shape=(), dtype=int32, numpy=7>, 'c': <tf.Tensor: id=68, shape=(), dtype=int32, numpy=13>}
{'a': <tf.Tensor: id=72, shape=(), dtype=int32, numpy=3>, 'b': <tf.Tensor: id=73, shape=(), dtype=int32, numpy=8>, 'c': <tf.Tensor: id=74, shape=(), dtype=int32, numpy=14>}
{'a': <tf.Tensor: id=78, shape=(), dtype=int32, numpy=4>, 'b': <tf.Tensor: id=79, shape=(), dtype=int32, numpy=9>, 'c': <tf.Tensor: id=80, shape=(), dtype=int32, numpy=15>}

从numpy中创建

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,100]))
# 取出前四个数
for ele in dataset.take(4):
    print(ele.numpy())

取出第一个数

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,100]))
print(next(iter(dataset.take(1))))

<tf.Tensor: id=109, shape=(), dtype=int64, numpy=1>

从元组中创建，该方式也是我们常用的，如从（featrue，label）中创建Dataset

featrue = np.array([[1,2],[3,4],[5,6]])
print("featrue shape:",featrue.shape)
label = np.array(['pig','dog','cat'])
print("label shape:",label.shape)
mydataset = tf.data.Dataset.from_tensor_slices((featrue,label)) 
for element_numpy in mydataset.as_numpy_iterator(): # 注意as_numpy_iterator在2.0版本中没有 需提高版本
    print(element_numpy)

featrue shape: (3, 2)
label shape: (3,)
(array([1, 2]), b'pig')
(array([3, 4]), b'dog')
(array([5, 6]), b'cat')

3 数据随机打散

shuffle(buffer_size,seed=None,reshuffle_each_iteration=None)

buffer_size：随机打乱元素排序的大小(越大越乱)
seed：随机种子
reshuffle_each_iteration：是否每次迭代都随机乱序

一般情况下使用shuffle(buffer_size)方法可以用来打散数据的顺序，可以防止每次训练时的数据固定顺序出场。buffer_size用于指定缓冲池的大小，一般设置一个较大的数

# 随机打散
dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.shuffle(10)
for ele in dataset:
    print(ele)

tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)

4 设置批大小

batch(batch_size,drop_remainder)将数据集的元素按照批次组合

batch_size：批次大小
drop_remainder：是否忽略批次组合后剩余的数据

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,8,9,10]))
dataset = dataset.batch(3)
for ele in dataset:
    print(ele)

tf.Tensor([1 2 3], shape=(3,), dtype=int64)
tf.Tensor([4 5 6], shape=(3,), dtype=int64)
tf.Tensor([7 8 9], shape=(3,), dtype=int64)
tf.Tensor([10], shape=(1,), dtype=int64)

5 重复数据

repeat(count=None)生成重复的数据集，count代表重复读取的次数。例如原数据为{1,2}，通过repeat(2)之后，则为{1,2,1,2}。另外，参数为空时也可以无限次读取。

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4,5])).repeat(3).batch(3)
for ele in dataset:
    print(ele)

tf.Tensor([1 2 3], shape=(3,), dtype=int64)
tf.Tensor([4 5 1], shape=(3,), dtype=int64)
tf.Tensor([2 3 4], shape=(3,), dtype=int64)
tf.Tensor([5 1 2], shape=(3,), dtype=int64)
tf.Tensor([3 4 5], shape=(3,), dtype=int64)

6 数据映射

map(map_func,num_parallel_calls=None)通过map_func函数将数据集中的每一个元素进行处理转换，返回一个新的数据集。

map_func：处理函数
num_parallel_calls：并行处理的线程数

示例1:

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(tf.square)
for ele in dataset:
    print(ele.numpy())

示例2

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(lambda x:x+1)
for ele in dataset:
    print(ele.numpy())

示例3

def re_xxx(x): # 定义处理函数
    return x*x*x
dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(re_xxx)
for ele in dataset:
    print(ele.numpy())

7 数据拼接

A.concatenate(B)将输入的序列或数据集组合在一起

dataset_A = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3])).shuffle(3)
dataset_B = tf.data.Dataset.from_tensor_slices(np.array([4, 5, 6])).shuffle(3)
dataset_AB = dataset_A.concatenate(dataset_B)
for ele in dataset_AB:
    print(ele.numpy())

8 打包成元组

zip(datasets)将多个数据集打包成新的元组序列，与python内置函数zip作用相同

dataset_fea = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3]))
dataset_lab = tf.data.Dataset.from_tensor_slices(np.array([4, 5, 6]))
datasets = tf.data.Dataset.zip((dataset_fea,dataset_lab))
for ele in datasets:
    print(ele)

(<tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=4>)
(<tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=5>)
(<tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=6>)

9 数据补充

padded_batch(batch_size,padded_shapes,padding_values=None)为数据集中的每个元素填补pading_values值

batch_size:生成的批次
padded_shapes:补充后的样本shape
padding_values:所需要填补的值（默认为0）

示例：

data = tf.data.Dataset.from_tensor_slices([[1,2],[3,4]])
# 在每条数据后面补充0，使shape变为4
data = data.padded_batch(2,padded_shapes=[4])
for ele in data:
    print(ele)

tf.Tensor(
[[1 2 0 0]
 [3 4 0 0]], shape=(2, 4), dtype=int32)

10 数据过滤

filter(predicate) 将整个数据集中的元素按照函数predicate进行过滤，留下使函数predicate返回True的数据

data = tf.data.Dataset.from_tensor_slices([-1,2,-3,4])
data = data.filter(lambda x:tf.less(x,1)) # 返回x小于1的数
for ele in data:
    print(ele)

tf.Tensor(-1, shape=(), dtype=int32)
tf.Tensor(-3, shape=(), dtype=int32)

11 设置缓冲区

prefetch(buffer_size)设置从数据集中取数据时的最大缓冲区，一般来该函数用在最后一步，
推荐将buffer_size设置为tf.data.experimental.AUTOTUNE，代表由系统自动设置缓存大小。

autotune = tf.data.experimental.AUTOTUNE
data = tf.data.Dataset.from_tensor_slices([-1,2,-3,4]).shuffle(10).repeat(3).batch(3).prefetch(autotune)
for ele in data:
    print(ele)

tf.Tensor([ 2  4 -1], shape=(3,), dtype=int32)
tf.Tensor([-3  4  2], shape=(3,), dtype=int32)
tf.Tensor([-3 -1  4], shape=(3,), dtype=int32)
tf.Tensor([-1 -3  2], shape=(3,), dtype=int32)

鹏阿鹏

关注

4
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow2中tf.data.Dataset对象的使用(常用函数总结)

tf.data.Dataset接口是一个生成Dataset数据的高级借口，在对于大型数据集的处理中有很大帮助，同时这也是官方推荐使用的数据处理方式。常用函数1 导包2 Dataset数据创建3 数据随机打散4 设置批大小5 重复数据6 数据映射7 数据拼接8 打包成元组...
复制链接

扫一扫