tensorflow2.0-Dataset

最新推荐文章于 2023-02-20 14:32:54 发布

jack_wine

最新推荐文章于 2023-02-20 14:32:54 发布

阅读量666

点赞数

分类专栏： Tensorflow

本文链接：https://blog.csdn.net/weixin_41485334/article/details/104411095

版权

Tensorflow 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

tensorflow2.0-Dataset

1. from_tensor_slices()

from_tensor_slices()可以将数据转化为可以迭代的Tensor类型（TensorSliceDataset），并在第一维度进行切分。
传入的数据类型：列表，元组，张量

# 只可以传入列表、元组、张量
# 列表
res = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for i in res:
    print(i)
# tf.Tensor(1, shape=(), dtype=int32)
# tf.Tensor(2, shape=(), dtype=int32)
# tf.Tensor(3, shape=(), dtype=int32)

# 元组
res = tf.data.Dataset.from_tensor_slices(([1,2,3],['a','b','c']))
for i in res:
    print(i)
#(<tf.Tensor: id=9, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=10, shape=(), dtype=string, numpy=b'a'>)
#(<tf.Tensor: id=11, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=12, shape=(), dtype=string, numpy=b'b'>)
#(<tf.Tensor: id=13, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=14, shape=(), dtype=string, numpy=b'c'>)

# 张量：单张量
a = tf.constant([1,2,3])
res = tf.data.Dataset.from_tensor_slices(a)
for i in res:
    print(i)
# tf.Tensor(1, shape=(), dtype=int32)
# tf.Tensor(2, shape=(), dtype=int32)
# tf.Tensor(3, shape=(), dtype=int32)

# 张量：元组
a = tf.constant([1,2,3])
b = tf.constant(['a','b','c'])
res = tf.data.Dataset.from_tensor_slices((a,b))
for i in res:
    print(i)
#(<tf.Tensor: id=9, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=10, shape=(), dtype=string, numpy=b'a'>)
#(<tf.Tensor: id=11, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=12, shape=(), dtype=string, numpy=b'b'>)
#(<tf.Tensor: id=13, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=14, shape=(), dtype=string, numpy=b'c'>)

# 张量：列表
# 张量的第一维需要相同
a = tf.constant([1,2,3])
b = tf.constant(['a','b','c'])
res = tf.data.Dataset.from_tensor_slices((a,b))
for i in res:
    print(i)
# tf.Tensor([1 2 3], shape=(3,), dtype=int32)
# tf.Tensor([11 22 33], shape=(3,), dtype=int32)

# 高维数组
# 按照第一维切分
arr = np.random.sample((6,3,2))
print(arr.shape)
# 6,3,2
res = tf.data.Dataset.from_tensor_slices(arr)
for i in res:
    print(i)
#  tf.Tensor(
# [[0.25434336 0.02573094]
#  [0.87407707 0.34301813]
#  [0.57893279 0.22542748]], shape=(3, 2), dtype=float64)
#  ...

# 高维张量
arr = np.random.sample((6,3,2))
tensor = tf.constant(arr)
res = tf.data.Dataset.from_tensor_slices(tensor)
for i in res:
    print(i)
# 结果同上

2.TensorSliceDataset类的方法

2.1 batch()

将原先的数据集进行分批处理，在可迭代的元素的第1维度增加1维形成批。

arr = np.random.sample((6,3,2))
res = tf.data.Dataset.from_tensor_slices(arr).batch(2)
for i in res:
    print(i.shape)
 # (2, 3, 2)

2.2 repeat()

将数据集中的数据重复，参数为重复的次数，如果不填，则根据实际需要自动选择重复的次数。

2.3 map()

对TensorSliceDataset中的每个Tensor做map中传入函数的操作

arr = np.random.sample((6,3,2))
res = tf.data.Dataset.from_tensor_slices(arr).map(lambda x:x*10)
for i in res:
    print(i)
#tf.Tensor(
#[[1.26557371 7.35635345]
# [1.93987519 8.77390075]
# [5.93275473 4.93974022]], shape=(3, 2), dtype=float64)
# ...

2.4 take()

从数据集中取出前几项构成一个新的数据集

2.5 skip()

跳过数据集中的几个数据，剩余数据形成新的数据集

2.7 prefetch()

为了提高数据的读取速度，预先加载进内存的数据，参数的为批数。如果批数为16，prefetch(2)则表示预先读取32个数据，一般该函数是在batch之后使用。

3.CsvDataset

读取CSV文件，主要参数有filenames，record_defaults，header，select_cols。分别代表了读取文件的绝对地址、数据的类型、是否读取头部、读取哪几列。

train_data = tf.data.experimental.CsvDataset('california_housing.csv', [tf.float32]*9, header=True)

4. TFrecords文件的读取和存储

4.1 TFrecords文件的存储

TFrecords文件进行存储是有一下几个层级：
数据存储时的层级为：
①tf.train.Example --> ②tf.train.Features --> ③tf.train.Feature --> ④tf.train.BytesList / tf.train.FloatList / tf.train.Int64List

④是最底层的层级结构为tf.train.XXXList，其中可以存储字节类型，浮点型，整数型三种，以类别的形式传入值。
④层是特征封装。
②为特征组织形式。
①产生了样本数据

④：tf.train.XXXList(value=[…])
③：tf.train.Feature(XXX_list=tf.train.XXXList)
②：tf.train.Features(feature={‘key’:tf.train.Feature})
①: tf.train.Example(features=tf.train.Features).SerializeToString()

存储为TFrecords文件的过程可以分类两种：

4.1.1 CSV数据 --> TFrecords

首先利用pandas读取CSV文件，由于TFTFrecords仅支持int64、float、bytes三种格式的数据，要将所有的字符串类型数据进行二进制编码。
然后，将上一步生成的特征列表转化为tf_feature。
最后生成，examples并写入。

import tensorflow as tf
import numpy as np
import pandas as pd

def read_csv_to_list(filenames):
    data = pd.read_csv(filenames)
    features = []
    feature_names = data.columns
    for column in feature_names:
        if isinstance(data[column].values[0],str):
        # 将str类型转化为bytes
            feature = list(map(lambda x: bytes(x, encoding='UTF8'), data[column].values))
        else:
            feature = data[column].values
        features.append(feature)
    return feature_names, features

def get_feature_list(features):
    tf_feature = []
    for feature in features:
        if isinstance(feature[0],np.int64):
            tf_feature.append(
                tf.train.Feature(int64_list=tf.train.Int64List(value=feature))
            )
        elif isinstance(feature[0],np.float64):
            tf_feature.append(
                tf.train.Feature(float_list=tf.train.FloatList(value=feature))
            )
        else:
            tf_feature.append(
                tf.train.Feature(bytes_list=tf.train.BytesList(value=feature))
            )

    return tf_feature

features_name, features = read_csv_to_list('eval.csv')
tf_feature = get_feature_list(features)

features = tf.train.Features(
    feature={i: j for i, j in zip(features_name, tf_feature)}
)
examples = tf.train.Example(features=features).SerializeToString()

with tf.io.TFRecordWriter('test.tfrecords') as writer:
    writer.write(examples)

本节的存储，仅仅示例TFrecords的组织和存储方法。
因此将所有数据存储为一个Example，但这种做法最后导致在训练是仍要进行有效的切分。
所有在存储时，应该将一个样本写入一个Example。

4.1.2 图片数据 --> TFrecords

import tensorflow as tf
import pandas as pd

def image_example(image_string, id, label):
	'''
	定义图像数据的存储特征
	id、宽、高、通道、标签、图像
	'''
  image_shape = tf.image.decode_png(image_string).shape
  feature = {
      'id': tf.train.Feature(int64_list=tf.train.Int64List(value=[id])),
      'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[image_shape[0]])),
      'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[image_shape[1]])),
      'depth': tf.train.Feature(int64_list=tf.train.Int64List(value=[image_shape[2]])),
      'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[label])),
      'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_string])),
  }
  return tf.train.Example(features=tf.train.Features(feature=feature)).SerializeToString()

# 从CSV文件中获取id和label
ids_labels_data = pd.read_csv('trainLabels.csv')
ids = ids_labels_data['id'].values
labels_data = list(map(lambda x:bytes(x,encoding='UTF8'),ids_labels_data['label'].values))
num_image = len(ids)

with tf.io.TFRecordWriter('cifar10_train_image.tfrecords') as writer:
    for i in range(num_image):
        filenames = './train/' + str(i+1) + '.png'
        image_string = open(filenames,'rb').read()
        # 多次封装写入
        image_tf_example = image_example(image_string, ids[i], labels_data[i])
        writer.write(image_tf_example)

4.1 TFrecords文件的读取

TFrecords都读取为TensorSliceDataset类，用于后续的训练。
读取的过程分为两步：
1.读取文件(TFRecordDataset)；
2.创建解析类型字典，构建解析函数(_parse_image_function);
3.map函数进行单个Example解析，形成数据集

import tensorflow as tf

raw_image_dataset = tf.data.TFRecordDataset('cifar10_train_image.tfrecords')


# Create a dictionary describing the features.
image_feature_description = {
    'id': tf.io.FixedLenFeature([], tf.int64),
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.string),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)

for image_features in parsed_image_dataset.take(3):
  # 此时得到的是图像的二进制字符串
  image_raw = image_features['image_raw'].numpy()
  for image_str_tensor in image_features['image_raw']:
  # 将每个图片进行解析，得到数组
  	img = tf.io.decode_png(image,channels=0,dtype=tf.dtypes.uint8,name=None)

由此可以发现，对于图片数据，编写TFrecords文件，需要做出解析，生成数据集进行训练。下一节将用ImageDataGenerator进行迭代起式的直接读取。

jack_wine

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
tensorflow2.0-Dataset

tensorflow2.0-CsvDataset1. from_tensor_slices()from_tensor_slices()可以将数据转化为可以迭代的Tensor类型（TensorSliceDataset），并在第一维度进行切分。传入的数据类型：列表，元组，张量# 只可以传入列表、元组、张量# 列表res = tf.data.Dataset.from_tensor_slice...
复制链接

扫一扫