tensorflow.dataset数据集操作

最新推荐文章于 2022-03-24 15:10:47 发布

三叶草～

最新推荐文章于 2022-03-24 15:10:47 发布

阅读量724

点赞数

分类专栏： TensorFlow2.X 文章标签： tensorflow 人工智能 python

本文链接：https://blog.csdn.net/weixin_67463124/article/details/123447663

版权

TensorFlow2.X 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

TensorFlow中数据集的使用非常频繁，以下总结几个常用操作。

1 生成数据集

从内存生成，适合不大的数据集

# 传入list，将list中元素逐个转换为Tensor对象然后依次放入Dataset中
x1 = [0, 1, 2, 3, 4]
x2 = [[0, 1], [2, 3], [4, 5]]
ds1 = tf.data.Dataset.from_tensor_slices(x1)
ds2 = tf.data.Dataset.from_tensor_slices(x2)
for step, m in enumerate(ds1):
    print(m)  # tf.Tensor(0, shape=(), dtype=int32)...
for step, m in enumerate(ds2):
    print(m)  # tf.Tensor([0 1], shape=(2,), dtype=int32)...

# 传入tuple。这种形式适合整合特征和标签。
xx = [[0, 1], [2, 3], [4, 5]]
yy = [11, 22, 33]
ds11 = tf.data.Dataset.from_tensor_slices((xx, yy))
for step, (ds11_xx, ds11_yy) in enumerate(ds11):
    print(ds11_xx)  # tf.Tensor([0 1], shape=(2,), dtype=int32)...
    print(ds11_yy)  # tf.Tensor(11, shape=(), dtype=int32)...

从CSV文件生成数据集请参考本人的另一篇文章：

CSV文件处理形成数据集_三叶草～的博客-CSDN博客CSV文件转TensorFlow2.0.dataset，划分训练集、测试集、验证集https://blog.csdn.net/weixin_67463124/article/details/123311272

2 操作数据集

对数据集的基本操作及遍历数据集

ds1 = tf.data.Dataset.from_tensor_slices(([[1, 2, 3], [4,5,6], [7,8,9], [77,77,88]], [11, 22, 33, 44]))
ds2 = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [4, 5, 6], [7, 8, 9], [77,77,88]])

# take(count) 返回ds2中count个元素的子集
ds_t = ds2.take(2)
for step, x in enumerate(ds_t):
    print(x)  # tf.Tensor([1 2 3], shape=(3,), dtype=int32)

# batch() 分批
# batch_size：在单个批次中合并的此数据集的连续元素数。
# drop_remainder：如果最后一批的数据量少于指定的batch_size，是否丢掉最后一批，默认为False
ds_b = ds2.batch(2)

# map() dataset中每个元素为参数执行pap_func()方法，
def preprocess(x, y):
    x = tf.reshape(x, [1, 3])
    y = tf.cast(y, tf.int32)
    return x, y
ds1.map(preprocess)

# shuffle() 打散数据集
# buffer_size：缓冲区大小，姑且认为是混乱程度吧，当值为1时，完全不打乱，当值为整个Dataset元素总数时，完全打乱。
# seed：将用于创建分布的随机种子。
# reshuffle_each_iteration：如果为true，则表示每次迭代数据集时都应进行伪随机重排，默认为True。
ds1.shuffle(4)

# repeat() count：重复次数，默认为None，表示不重复，当值为-1时，表示无限重复
ds1.repeat(-1)

3 整合数据集

对以形成的数据集进行横向拼接、纵向拼接

# 将两个数据集放到一起，同时迭代，也可整合特征数据集和标签数据集
ds = tf.data.Dataset.zip((ds1, ds2))
for step, ((ds1_x, ds2_y), ds2_x) in enumerate(ds):
    print(step, ds1_x, ds2_y, ds2_x)

# 拼接两个数据集
ds = ds1.concatenate(ds1)  # (4, 3) => (8, 3)