TensorFlow2.x 数据IO及预处理

最新推荐文章于 2024-05-08 10:35:24 发布

置顶 Arrow

最新推荐文章于 2024-05-08 10:35:24 发布

阅读量884

点赞数 2

文章标签： tensorflow

本文链接：https://blog.csdn.net/MyArrow/article/details/108710725

版权

1. 简介

TensorFlow支持的数据加载方式：
- 使用Python代码为TensorFlow提供数据
- Dataset
- tfrecord
小数据常用数据格式
- csv格式
- npy, npz格式：这是numpy的数据保存格式
- pkl：这是python的序列化保存格式
- hdf：以HDF5为最新的系列
大数据常用数据格式
- tfrecord：以二进制方式存储，适合以串行的方式读取大批量的数据

2. Dataset

tf.data 的核心是tf.data.Dataset 类，提供了对数据集的高层封装。
-tf.data.Dataset 由一系列的可迭代访问的元素（element）组成，每个元素包含一个或多个张量。比如说，对于一个由图像组成的数据集，每个元素可以是一个形状为长×宽×通道数的图片张量，也可以是由图片张量和图片标签张量组成的元组（Tuple）
Dataset:表示元素序列集合，每个元素包含一个或者多个Tensor对象，每个元素是一个样本。
创建Dataset的两种方式：
- 从源数据创建，比如：Dataset.from_tensor_slices()
  - 从输入的tensor’中获取数据，这个tensor可以是图片的路径组成的list
- 通过数据处理转换创建，比如Dataset.map()/batch()
- 通过读取图像文件创建 tf.keras.preprocessing.image_dataset_from_directory
tf.data.Dataset API用于创建喂数据给model的管道，它使用简单、可重用的数据生成一个高效、复杂的输入管道

2.0 创建Dataset的API

API	数据源	定义
Dataset.from_tensors	Python List	`from_tensors(tensors)` 生成的Dataset只有一个元素，即输入的tensors
Dataset.from_tensor_slices	Python List	`from_tensor_slices(tensors)` 其元素是输入tensors的切片
tf.data.TextLineDataset	文本文件	`tf.data.TextLineDataset( filenames, compression_type=None, buffer_size=None, num_parallel_reads=None)` 包含来自一个或多个文本文件的行，即一行一个张量
tf.data.TFRecordDataset	二进制文件（基于protobuf）	`tf.data.TFRecordDataset(filenames, compression_type=None, buffer_size=None, num_parallel_reads=None)` 包含来自一个或多个TFRecord文件的记录(record)
tf.keras.preprocessing. image_dataset_from_directory	目录下的图像文件 (jpeg, png, bmp, gif)	`image_dataset_from_directory( main_directory, labels='inferred', label_mode='int', class_names=None, color_mode='rgb', batch_size=32, image_size=(256, 256), shuffle=True, seed=None, validation_split=None, subset=None, interpolation='bilinear', follow_links=False)` 根据目录下的图像文件产生Dataset，其类别为子目录名

2.0.1 Dataset.from_tensors

示例

dataset = tf.data.Dataset.from_tensors([1, 2, 3, 4])
print(type(dataset))
for element in dataset:
    print(element)
    
a = list(dataset.as_numpy_iterator())
print(a)

dataset = tf.data.Dataset.from_tensors(([1, 2, 3, 4], 'A'))
for element in dataset:
    print(element)

b = list(dataset.as_numpy_iterator())
print(b)

输出

<class 'tensorflow.python.data.ops.dataset_ops.TensorDataset'>
tf.Tensor([1 2 3 4], shape=(4,), dtype=int32)
[array([1, 2, 3, 4])]
(<tf.Tensor: shape=(4,), dtype=int32, numpy=array([1, 2, 3, 4])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)
[(array([1, 2, 3, 4]), b'A')]

2.0.2 Dataset.from_tensor_slices

示例1

# Slicing a 1D tensor produces scalar tensor elements.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4])
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()))
print("")

# Slicing a 2D tensor produces 1D tensor elements.
dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4],[5, 6]])
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()))
print("")

# Slicing a tuple of 1D tensors produces tuple elements containing
# scalar tensors.
dataset = tf.data.Dataset.from_tensor_slices(([1, 2], [3, 4], [5, 6]))
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()))
print("")

# Dictionary structure is also preserved.
dataset = tf.data.Dataset.from_tensor_slices({"a": [1, 2], "b": [3, 4]})
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()) == [{'a': 1, 'b': 3},
                                      {'a': 2, 'b': 4}])
print("")

输出1

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
[1, 2, 3, 4]

tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor([5 6], shape=(2,), dtype=int32)
[array([1, 2]), array([3, 4]), array([5, 6])]

(<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=int32, numpy=3>, <tf.Tensor: shape=(), dtype=int32, numpy=5>)
(<tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=int32, numpy=6>)
[(1, 3, 5), (2, 4, 6)]

{'a': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'b': <tf.Tensor: shape=(), dtype=int32, numpy=3>}
{'a': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'b': <tf.Tensor: shape=(), dtype=int32, numpy=4>}
True

示例2

# Two tensors can be combined into one Dataset object.
features = tf.constant([[1, 3], [2, 1], [3, 3]]) # ==> 3x2 tensor
labels = tf.constant(['A', 'B', 'A']) # ==> 3x1 tensor
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()))
print("")

# Both the features and the labels tensors can be converted
# to a Dataset object separately and combined after.
features_dataset = tf.data.Dataset.from_tensor_slices(features)
labels_dataset = tf.data.Dataset.from_tensor_slices(labels)
dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))
for element in dataset:
    print(element)
print(list(dataset.as_numpy_iterator()))
print("")

输出2

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 1])>, <tf.Tensor: shape=(), dtype=string, numpy=b'B'>)
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)
[(array([1, 3]), b'A'), (array([2, 1]), b'B'), (array([3, 3]), b'A')]

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 1])>, <tf.Tensor: shape=(), dtype=string, numpy=b'B'>)
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)
[(array([1, 3]), b'A'), (array([2, 1]), b'B'), (array([3, 3]), b'A')]

2.0.3 tf.data.TextLineDataset

示例

dataset = tf.data.TextLineDataset(["data\\file1.txt", "data\\file2.txt"]) 
for element in dataset:
    print(element)

输出

tf.Tensor(b'File1 : 1111', shape=(), dtype=string)
tf.Tensor(b'File1 : 1112', shape=(), dtype=string)
tf.Tensor(b'File1 : 1113', shape=(), dtype=string)
tf.Tensor(b'File1 : 1114', shape=(), dtype=string)
tf.Tensor(b'File2: 221', shape=(), dtype=string)
tf.Tensor(b'File2: 222', shape=(), dtype=string)
tf.Tensor(b'File2: 223', shape=(), dtype=string)
tf.Tensor(b'File2: 224', shape=(), dtype=string)

2.1 为什么需要Dataset API

简洁性: 对数据进行统一管理和处理，使代码简洁
- 常规方式：用python代码来进行batch，shuffle，padding等numpy类型的数据处理，再转换成tensor类型。因此在网络的训练过程中，不得不在tensorflow的代码中穿插python代码来实现控制
- Dataset API：将数据直接放在graph中进行处理，整体对数据集进行上述数据操作，使代码更加简洁。
对接性
- 使用Dataset API管理数据，可与网络训练、测试无缝对接

2.2 Dataset使用流程

使用流程
- 根据输入集创建一个dataset
- 做一些数据处理（map， batch）
- 循环数据集
生成Dataset

import tensorflow as tf
from tensorflow import keras
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
for element in dataset:
    print(element)
dataset1 = tf.data.TextLineDataset(["data\\file1.txt", "data\\file2.txt"]) 
for element in dataset1:
    print(element)

输出

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(b'1111', shape=(), dtype=string)
tf.Tensor(b'222', shape=(), dtype=string)

2.3 从数据创建（ tf.data.Dataset.from_tensor_slices）

最基础的建立tf.data.Dataset 的方法是使用tf.data.Dataset.from_tensor_slices()
适用于数据量较小（能够整个装进内存）的情况
具体而言，如果我们的数据集中的所有元素通过张量的第0 维，拼接成一个大的张量（例如，MNIST 数据集的训练集即为一个[60000, 28, 28, 1] 的张量，表示了60000 张28*28 的单通道灰度图像）

data = np.array([0.1, 0.4, 0.6, 0.2, 0.8, 0.8, 0.4, 0.9, 0.3, 0.2])
label = np.array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0])
#可以通过tf.data.Dataset.from_tensor_slices建立数据集。

dataset = tf.data.Dataset.from_tensor_slices((data, label))
for x, y in dataset:
    print(x, y)

输出

tf.Tensor(0.1, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(0.4, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(0.6, shape=(), dtype=float64) tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0.2, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(0.8, shape=(), dtype=float64) tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0.8, shape=(), dtype=float64) tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0.4, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(0.9, shape=(), dtype=float64) tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0.3, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(0.2, shape=(), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)

2.4 数据集对象预处理

tf.data.Dataset 类为我们提供了多种数据集预处理方法，常用方法如下表：

方法	描述
Dataset.map(f)	对数据集中的每个元素应用函数f ，得到一个新的数据集（这部分往往结合tf.io 进行读写和解码文件，tf.image 进行图像处理）
Dataset.shuffle(buffer_size)	将数据集打乱
Dataset.batch(batch_size)	将数据集分成批次，即对每batch_size 个元素，使用tf.stack() 在第0 维合并，成为一个元素
Dataset.prefetch()	预取出数据集中的若干个元素（可提升训练流程并行效率）
Dataset.repeat()	重复数据集的元素
Dataset.reduce()	与Map 相对的聚合操作
Dataset.take ()	截取数据集中的前若干个元素

2.4.1 map

定义

map(
    map_func, num_parallel_calls=None, deterministic=None
)

功能
- 对数据集中的每个元素执行函数map_func, 且其元素顺序不变
示例

dataset = tf.data.Dataset.range(1,6)
dataset1 = dataset.map(lambda x: x**2)

for element in dataset1:
    print(element.numpy())

输出

2.4.2 repeat

X = np.array([1, 2, 3, 4,  5,  6,  7,  8,  9,  10 ])
Y = np.array([1, 4, 9, 16, 25, 36, 49, 64, 81, 100])

#从内存中创建更复杂的Dataset
dataset = tf.data.Dataset.from_tensor_slices((X, Y))
dataset1 = dataset.repeat()
it = dataset1.__iter__()
for i in range(20):
    x, y = it.next()
    print(x.numpy(), y.numpy())

输出

2.4.3 shuffle

dataset2 =  dataset.shuffle(buffer_size=10)
it = dataset2.__iter__()
for i in range(10):
    x, y = it.next()
    print(x.numpy(), y.numpy())

输出

2.4.4 batch

dataset_batch = dataset.batch(batch_size=5)
it = dataset_batch.__iter__()
for i in range(2):
    x, y = it.next()
    print(x.numpy(), y.numpy())

[1 2 3 4 5] [ 1  4  9 16 25]
[ 6  7  8  9 10] [ 36  49  64  81 100]

2.4.5 较多的实例数据

(train_data, train_label), (test_data, test_lable) = tf.keras.datasets.mnist.load_data()  
print(train_data.shape)  # [60000, 28, 28]
# print(train_data[0])
train_data = np.expand_dims(train_data.astype(np.float32) / 255.0, axis=-1)      # [60000, 28, 28, 1]
print("train_data:")
# print(train_data[0])
print(train_data.shape)
print(train_label.shape)

print("test_data:")
print(test_data.shape)
print(test_lable.shape)

mnist_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_label))
print(mnist_dataset)

mnist_dataset=mnist_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
print(mnist_dataset)

输出

(60000, 28, 28)
train_data:
(60000, 28, 28, 1)
(60000,)
test_data:
(10000, 28, 28)
(10000,)
<TensorSliceDataset shapes: ((28, 28, 1), ()), types: (tf.float32, tf.uint8)>
<PrefetchDataset shapes: ((28, 28, 1), ()), types: (tf.float32, tf.uint8)>

2.5 从图像文件创建（image_dataset_from_directory）

函数名： tf.keras.preprocessing.image_dataset_from_directory
用途： 从目录中的图像文件生成tf.data.Dataset。
函数定义

tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    labels='inferred',
    label_mode='int',
    class_names=None,
    color_mode='rgb',
    batch_size=32,
    image_size=(256, 256),
    shuffle=True,
    seed=None,
    validation_split=None,
    subset=None,
    interpolation='bilinear',
    follow_links=False,
)

功能说明

# 图像文件目录结构
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

调用 image_dataset_from_directory(main_directory, labels='inferred')将返回tf.data.Dataset，此Dataset从子目录“ class_a”和“ class_b”以及标签0和1（0对应于“ class_a”和1对应于“ class_b”）产生一批图像。
支持的图像文件格式：jpeg, png, bmp和gif （对于动态的gif只使用第一帧图像）
示例

# training Dataset
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)
# validation Dataset
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

3. tfrecord

本质：二进制化、大大减少文件数量、分片快速读写
- tfrecrod：是一种二进制文件格式，理论上它可以保存任何格式的信息，提高存储和读写效率
- 内部使用了“Protocol Buffer”二进制数据编码方案
- TFRecord是以字典的方式一次写一个样本

3.1 为什么需要tfrecord

正常情况下我们训练文件夹经常会生成 train, test 或者val文件夹，这些文件夹内部往往会存着成千上万的图片或文本等文件，这些文件被散列存着，这样不仅占用磁盘空间，并且再被一个个读取的时候会非常慢，繁琐。占用大量内存空间（有的大型数据不足以一次性加载）。此时我们TFRecord格式的文件存储形式会很合理的帮我们存储数据。
在数据集较小时，我们会把数据全部加载到内存里方便快速导入，但当数据量超过内存大小时，就只能放在硬盘上来一点点读取，这时就不得不考虑数据的移动、读取、处理等速度。使用TFRecord就是为了提速和节约空间的。

3.2 tfrecord特点

Tensorflow官方推荐的一种较为高效的数据读取方式——tfrecord
tfrecord其实是一种数据存储形式。使用tfrecord时，实际上是先读取原生数据，然后转换成tfrecord格式，再存储在硬盘上。而使用时，再把数据从相应的tfrecord文件中解码读取出来
优势对比从硬盘读取原生数据：
- Tensorflow有和tfrecord配套的一些函数，可以加快数据的处理
- tfrecord内部使用了“Protocol Buffer”二进制数据编码方案，它只占用一个内存块，只需要一次性加载一个二进制文件的方式即可，简单，快速，尤其对大型训练数据很友好。而且当我们的训练数据量比较大的时候，可以将数据分成多个tfrecord文件，来提高处理效率
- 流水线并行读取数据：实际读取tfrecord数据时，先以相应的tfrecord文件为参数，创建一个输入队列，这个队列有一定的容量（视具体硬件限制，用户可以设置不同的值），在一部分数据出队列时，tfrecord中的其他数据就可以通过预取进入队列，并且这个过程和网络的计算是独立进行的

3.2 tfrecord的写入

tfrecord支持写入以下三种格式的数据，以列表的形式分别通过tf.train.BytesList，tf.train.Int64List，tf.train.FloatList 写入tf.train.Feature
- string
- int64
- float32，

import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
import sys
import cv2
from random import shuffle
import glob

shuffle_data = True # shuffle the addresses before saving
cat_dog_train_path = 'data/CatvsDog/train/*.jpg'

# read addresses and labels from the 'train' folder
addrs = glob.glob(cat_dog_train_path)
print('addrs:',addrs)
labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog
print(labels)
# to shuffle data
if shuffle_data:
    c = list(zip(addrs, labels))
    shuffle(c)
    addrs, labels = zip(*c)

print("len=", len(c))    
print('c=',c)
print('addrs=', addrs)
print('labels=', labels)

# Divide the data into 60% train, 20% validation, and 20% test
train_addrs = addrs[0:int(0.6*len(addrs))]
train_labels = labels[0:int(0.6*len(labels))]
val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]

test_addrs = addrs[int(0.8*len(addrs)):]
test_labels = labels[int(0.8*len(labels)):]
print(len(train_addrs))

def load_image(addr):
    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)
    return img

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

train_filename = 'train.tfrecords' # address to save the TFRecords file

# 创建并打开 TFRecords 文件
writer = tf.io.TFRecordWriter(train_filename)
for i in range(len(train_addrs)):
    # 这是写入操作可视化处理
    #if not i % 1000:
    print('Train data: {}/{}'.format(i, len(train_addrs)))
    sys.stdout.flush()
    # 加载图片
    img = load_image(train_addrs[i])
 
    label = train_labels[i]
 
    # 创建一个属性（feature） 两个key-valure对
    feature = {'train/label': _int64_feature(label),
               'train/image': _bytes_feature(tf.compat.as_bytes(img.tostring()))}
    
    # print("feature=", feature)
 
    # 创建一个 example protocol buffer
    example = tf.train.Example(features=tf.train.Features(feature=feature))
 
    # 将上面的example protocol buffer写入文件
    writer.write(example.SerializeToString())
 
writer.close()
sys.stdout.flush()

3.3 tfrecord的读取

4. Dataset与tfrecord的区别

tfrecord
- 优点：需要提前将数据存成tfrecord文件，这样可以减少每次打开文件的时间消耗，针对于大规模数据集训练模型上有帮助
- 缺点：这种方式比较死板，如果有新的数据集，就需要继续生成tfrecord文件
tf.data.Dataset
- 优点：这种方式就比较灵活了，采用pipeline的方式，在GPU训练数据时，CPU准备数据，不需要提前生成其他文件
用途
- 这两种方式都可以处理大规模数据集的训练

Arrow

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow2.x 数据IO及预处理

1. 基本概论1.1 Datasettf.data 的核心是tf.data.Dataset 类，提供了对数据集的高层封装。tf.data.Dataset 由一系列的可迭代访问的元素（element）组成，每个元素包含一个或多个张量。比如说，对于一个由图像组成的数据集，每个元素可以是一个形状为长×宽×通道数的图片张量，也可以是由图片张量和图片标签张量组成的元组（Tuple）Dataset:表示元素序列集合，每个元素包含一个或者多个Tensor对象，每个元素是一个样本。创建Dataset的两种方式：
复制链接

扫一扫