Python之数据集制作读取——TensorFlow

最新推荐文章于 2024-04-14 17:17:16 发布

zx520113

最新推荐文章于 2024-04-14 17:17:16 发布

阅读量1.5w

点赞数 7

分类专栏： Python 文章标签： Python数据集制作 TensorFlow数据集 tf.data.TFRecordDataset AVX2 TensorFlow图像增强

本文链接：https://blog.csdn.net/zx520113/article/details/84556489

版权

Python 专栏收录该内容

63 篇文章 8 订阅

订阅专栏

在进行机器学习实验之前，需要准备训练测试学习所需要的图像数据，如果将图像数据打包以及读取呢？

图像数据打包，早TensorFlow中有一个常用的函数tf.python_io.TFRecordWriter(save_file)将tf.train.Example读取到的数据存放在.tfrecords（train.tfrecords）文件中，将图像存储成二进制格式。

def Save_data(filename='..\\imgout',save_file="train.tfrecords",img_size=(224,224)):
    """
    将文件夹下的图像数据存储在.tfrecords文件中,通过调用TensorFlow中的train.Example实现
    :param filename: 文件地址
    :param save_file: 保存的文件地址（文件名）
    :return: None
    """
    writer = tf.python_io.TFRecordWriter(save_file)
    for index in os.listdir(filename):
        print("Label",index)
        class_path = filename +"\\"+ index+"\\"
        for img_name in os.listdir(class_path):
            img_path = class_path + img_name
            img = Image.open(img_path)
            img = img.resize(img_size)
            img = img.tobytes() #将图片转化为原生bytes
            example = tf.train.Example(features=tf.train.Features(feature={
                    "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(index)])),
                    'img': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img]))}))
            writer.write(example.SerializeToString())
    writer.close()

对.tfrecords文件进行读取，直接读取打印出来的是二进制格式的。

for serialized_example in tf.python_io.tf_record_iterator(filename):
    example = tf.train.Example()
    example.ParseFromString(serialized_example)
    img = example.features.feature['img'].bytes_list.value
    label = example.features.feature['label'].int64_list.value
    return img,label

如果样读取成TensorFlow在搭建好神经网络模型中该如何读取呢？调用MNIST、CIFAR等数据集测试学习的话有固定的调用函数，但是如果是自己的图像测试训练数据呢？

def read_and_decode(filename):
    """参考博客https://blog.csdn.net/u012759136/article/details/52232266"""
    filename_queue = tf.train.string_input_producer([filename])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,features={
                                           'label': tf.FixedLenFeature([], tf.int64),
                                           'img' : tf.FixedLenFeature([], tf.string),})
    img = tf.decode_raw(features['img'], tf.uint8)
    img = tf.reshape(img, [224, 224, 3])
    img = tf.cast(img, tf.float32) * (1. / 255) - 0.5
    label = tf.cast(features['label'], tf.int32)
    return img, label

if __name__=="__main__":
    # Save_data(filename)#保存数据
    img, label = read_and_decode("train.tfrecords")#读取数据
    img_batch, label_batch = tf.train.shuffle_batch([img, label],batch_size=30, capacity=200,min_after_dequeue=100)
    init = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init)
        threads = tf.train.start_queue_runners(sess=sess)
        for i in range(3):
            val, l= sess.run([img_batch, label_batch])
            print(val.shape , l)

这种方法调用参考的是https://blog.csdn.net/u012759136/article/details/52232266中的，可以使用。但是随着TensorFlow版本的更新，在使用中会显示一些红色的，“已弃用，将在以后的版本中删除”

WARNING:tensorflow:From G:/python/TensorFlowData/2.py:82: string_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From E:\python\python3\lib\site-packages\tensorflow\python\training\input.py:276: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From E:\python\python3\lib\site-packages\tensorflow\python\training\input.py:188: limit_epochs (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
WARNING:tensorflow:From E:\python\python3\lib\site-packages\tensorflow\python\training\input.py:197: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From E:\python\python3\lib\site-packages\tensorflow\python\training\input.py:197: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From G:/python/TensorFlowData/2.py:83: TFRecordReader.__init__ (from tensorflow.python.ops.io_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From G:/python/TensorFlowData/2.py:106: shuffle_batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)`.
WARNING:tensorflow:From E:\python\python3\lib\site-packages\tensorflow\python\util\tf_should_use.py:189: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
2018-11-26 21:12:09.077078: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From G:/python/TensorFlowData/2.py:110: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
(30, 224, 224, 3) [1 2 1 1 2 2 1 2 1 2 2 2 2 1 1 2 2 1 1 2 2 2 1 2 3 1 1 2 2 1]
(30, 224, 224, 3) [2 1 1 3 3 1 2 2 1 2 2 1 2 3 2 2 3 1 1 2 1 1 2 1 1 3 2 1 3 1]
2018-11-26 21:12:09.271089: W tensorflow/core/kernels/queue_base.cc:277] _0_input_producer: Skipping cancelled enqueue attempt with queue not closed
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer/input_producer_EnqueueMany}} = QueueEnqueueManyV2[Tcomponents=[DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](input_producer, input_producer/Const, ^input_producer/Assert/Assert)]]
(30, 224, 224, 3) [3 1 2 2 2 1 1 1 2 2 3 2 3 3 3 3 2 3 3 3 3 1 3 1 1 3 2 2 1 3]

根据提示，重新查找更改成相应版本的最新方法，有数据，但是在打印图像和形状的时候出错提示。

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 37632 values, but the requested shape has 150528
	 [[{{node Reshape}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw, Reshape/shape)]]
	 [[node IteratorGetNext (defined at G:/python/TensorFlowData/2.py:69)  = IteratorGetNext[output_shapes=[[?,203,203,3], [?]], output_types=[DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

查看了博客tf.data.dataset的相关知识，https://blog.csdn.net/neu_chenguangq/article/details/79590537

发现原来在参数类型定义错误了，tf.float32，改变成tf.uint8后就行了。

def Read_data(filename="train.tfrecords",shape=[224,224,3]):
    """
        二进制读取从文件中读取图像数据
    :param filename: 文件名
    :param choose: 读取模式选择
    :return: image，label：返回二进制数据和图像标签
    """
    def parser(record):
        features = tf.parse_single_example(record, features={
            'label': tf.FixedLenFeature([], tf.int64),
            'img': tf.FixedLenFeature([], tf.string), })
        img = tf.decode_raw(features["img"], tf.uint8)#注意在这里只能是tf.uint8，tf.float32会报错
        img = tf.reshape(img, shape)
        # 归一化，转换到0-1之间
        # img = tf.cast(img, tf.float32) * (1. / 255.) - 0.5
        img = tf.div(tf.to_float(img), 255.0)
        label = tf.cast(features["label"], tf.int64)
        return img,label

    # if choose==1:
    dataset=tf.data.TFRecordDataset(filename)
    dataset = dataset.map(parser)
    dataset=dataset.repeat()
    dataset=dataset.batch(1)#步长
    dataset=dataset.shuffle(buffer_size=1)#batch(1)获取一张图像每次,buffer size=1，数据集不打乱；如果shuffle 的buffer size=数据集样本数量，随机打乱整个数据集
    iterator = dataset.make_one_shot_iterator()
    imglabelout = iterator.get_next()
    return imglabelout

if __name__ == '__main__':
    datasetimg=Read_data()
    sess = tf.Session()
    for i in range(10):
        img, label = sess.run(datasetimg)
        print(img.shape, label)
        print()
    sess.close()

直接载入的图像在特征形态上过于单一，可以通过tf.image中的相关餐宿对图像进行增强，平移、旋转、翻转、裁剪、缩放、噪声扰动、亮度以及对比度等。

在图像增强之前，如果图像数据只有10张，如何批量变成100张呢，当然可以直接在文件夹下赋值粘贴，再更改成相应的名字，在TensorFlow中，可以这样实现，图像的地址根据自己需要更改。

with tf.Session() as sess:
    filename=['31.jpg','32.jpg','33.jpg']
    filename_queue=tf.train.string_input_producer(filename,shuffle=False,num_epochs=20)
    reader=tf.WholeFileReader()
    key,value=reader.read(filename_queue)
    tf.local_variables_initializer().run()
    threads=tf.train.start_queue_runners(sess=sess)
    i=0
    while True:
        i+=1
        img_data=sess.run(value)
        with open('img\\3\\test_%d.jpg'%i,'wb') as f:
            f.write(img_data)

图像增强：

    def enhance_img(img, width, height):
        """图像增强"""
        img = tf.cast(img, tf.float32)  # 改变图像的类型，float32
        distorted_img = tf.random_crop(img, [width, height, 3])  # 将图像随机从剪切按照原图像的尺寸除以1.1后的大小进行剪切
        distorted_img = tf.image.random_flip_left_right(distorted_img)  # 将图像有50%的概率水平左右随机翻转，有50%的概率保持不变
        distorted_img = tf.image.random_flip_up_down(distorted_img)  # 将图像有50%的概率水平上下随机翻转，有50%的概率保持不变
        distorted_img = tf.image.random_brightness(distorted_img, max_delta=63)  # 随机改变图像的亮度
        distorted_img = tf.image.random_contrast(distorted_img, lower=0.2, upper=1.8)  # 随机改变对比度
        distorted_img = tf.image.per_image_standardization(distorted_img) #图像标准化
        return distorted_img

批量复制后的图像数据保存在本地，可以通过os来寻找文件夹中的文件，然后通过cv2或者Image来读取图像并进行相关操作。

def save_enhance_img(filein,fileout):
    """
    :param filein: #输入图像的文件目录
    :param fileout: #图像处理后的文件目录
    :return: 
    """
    files = os.listdir(filein)  # 将文件夹中
    for filename in files:
        file1 = filein + filename + '\\'
        # os.chdir(file1)
        file1s = os.listdir(file1)
        for filename1 in file1s:
            filename_out = file1 + filename1
            img = cv2.imread(filename_out)
            w, h = img.shape[:2]
            img = tf.convert_to_tensor(img)
            img = enhance_img(img, int(w / 1.1), int(h / 1.1))
            imgout = img.eval()
            outfile = fileout + filename + '\\' + filename1
            cv2.imwrite(outfile, imgout)
    print("OK！")

Python TensorFlow数据存储读取所有程序：

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import os
import tensorflow as tf
from PIL import Image

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

filename='..\\imgout'#图像数据文件夹，在同一个大型目录下的两个不同小目录里面，不同目录需要更改“..\\”

def Save_data(filename='..\\imgout',save_file="train.tfrecords",img_size=(224,224)):
    """
    将文件夹下的图像数据存储在.tfrecords文件中,通过调用TensorFlow中的train.Example实现
    :param filename: 文件地址
    :param save_file: 保存的文件地址（文件名）
    :return: None
    """
    writer = tf.python_io.TFRecordWriter(save_file)
    for index in os.listdir(filename):
        print("Label",index)
        class_path = filename +"\\"+ index+"\\"
        for img_name in os.listdir(class_path):
            img_path = class_path + img_name
            img = Image.open(img_path)
            img = img.resize(img_size)
            img = img.tobytes() #将图片转化为原生bytes
            example = tf.train.Example(features=tf.train.Features(feature={
                    "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(index)])),
                    'img': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img]))}))
            writer.write(example.SerializeToString())
    writer.close()

def Read_data(filename="train.tfrecords",shape=[224,224,3]):
    """
        二进制读取从文件中读取图像数据
    :param filename: 文件名
    :param choose: 读取模式选择
    :return: image，label：返回二进制数据和图像标签
    """
    def parser(record):
        features = tf.parse_single_example(record, features={
            'label': tf.FixedLenFeature([], tf.int64),
            'img': tf.FixedLenFeature([], tf.string), })
        img = tf.decode_raw(features["img"], tf.uint8)#注意在这里只能是tf.uint8，tf.float32会报错
        img = tf.reshape(img, shape)
        img=enhance_img(img,int(shape[0]/1.1),int(shape[1]/1.1))#图像增强
        # 归一化，转换到0-1之间
        # img = tf.cast(img, tf.float32) * (1. / 255.) - 0.5
        img = tf.div(tf.to_float(img), 255.0)
        label = tf.cast(features["label"], tf.int64)
        return img,label

    def enhance_img(img, width, height):
        """图像增强"""
        img = tf.cast(img, tf.float32)  # 改变图像的类型，float32
        distorted_img = tf.random_crop(img, [width, height, 3])  # 将图像随机从剪切按照原图像的尺寸除以1.1后的大小进行剪切
        distorted_img = tf.image.random_flip_left_right(distorted_img)  # 将图像有50%的概率水平左右随机翻转，有50%的概率保持不变
        distorted_img = tf.image.random_flip_up_down(distorted_img)  # 将图像有50%的概率水平上下随机翻转，有50%的概率保持不变
        distorted_img = tf.image.random_brightness(distorted_img, max_delta=63)  # 随机改变图像的亮度
        distorted_img = tf.image.random_contrast(distorted_img, lower=0.2, upper=1.8)  # 随机改变对比度
        distorted_img = tf.image.per_image_standardization(distorted_img) #图像标准化
        return distorted_img

    # if choose==1:
    dataset=tf.data.TFRecordDataset(filename)
    dataset = dataset.map(parser)
    dataset=dataset.repeat()
    dataset=dataset.batch(1)#步长
    dataset=dataset.shuffle(buffer_size=1)#batch(1)获取一张图像每次,buffer size=1，数据集不打乱；如果shuffle 的buffer size=数据集样本数量，随机打乱整个数据集
    iterator = dataset.make_one_shot_iterator()
    imglabelout = iterator.get_next()
    return imglabelout
    # else:
    #     """图像二进制形式输出"""
    #     for serialized_example in tf.python_io.tf_record_iterator(filename):
    #         example = tf.train.Example()
    #         example.ParseFromString(serialized_example)
    #         img = example.features.feature['img'].bytes_list.value
    #         label = example.features.feature['label'].int64_list.value
    #         return img,label

def read_and_decode(filename):
    """参考博客https://blog.csdn.net/u012759136/article/details/52232266"""
    filename_queue = tf.train.string_input_producer([filename])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,features={
                                           'label': tf.FixedLenFeature([], tf.int64),
                                           'img' : tf.FixedLenFeature([], tf.string),})
    img = tf.decode_raw(features['img'], tf.uint8)
    img = tf.reshape(img, [224, 224, 3])
    img = tf.cast(img, tf.float32) * (1. / 255) - 0.5
    label = tf.cast(features['label'], tf.int32)
    return img, label

if __name__ == '__main__':
    datasetimg=Read_data()
    sess = tf.Session()
    for i in range(10):
        img, label = sess.run(datasetimg)
        print(img.shape, label)
        print()
    sess.close()

# if __name__=="__main__":
#     # Save_data(filename)#保存数据
#     img, label = read_and_decode("train.tfrecords")#读取数据
#     img_batch, label_batch = tf.train.shuffle_batch([img, label],batch_size=30, capacity=200,min_after_dequeue=100)
#     init = tf.initialize_all_variables()
#     with tf.Session() as sess:
#         sess.run(init)
#         threads = tf.train.start_queue_runners(sess=sess)
#         for i in range(3):
#             val, l= sess.run([img_batch, label_batch])
#             print(val.shape , l)

    Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2问题可以通过在代码开头加入这个来解决。

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

下面是CSV格式的数据制作：https://blog.csdn.net/zx520113/article/details/84557459

zx520113

关注

7
点赞
踩
64

收藏

觉得还不错? 一键收藏
2
评论
Python之数据集制作读取——TensorFlow

在进行机器学习实验之前，需要准备训练测试学习所需要的图像数据，如果将图像数据打包以及读取呢？图像数据打包，早TensorFlow中有一个常用的函数tf.python_io.TFRecordWriter(save_file)将tf.train.Example读取到的数据存放在.tfrecords（train.tfrecords）文件中，将图像存储成二进制格式。def...
复制链接

扫一扫