tensorflow解决读取数据

最新推荐文章于 2021-02-02 20:08:11 发布

越奋斗，越幸运

最新推荐文章于 2021-02-02 20:08:11 发布

阅读量212

点赞数

分类专栏： tensorflow1

本文链接：https://blog.csdn.net/fanjianhai/article/details/103064399

版权

tensorflow1 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1. 队列与线程

1.1. tf.FIFOQueue

FIFOQueue(capacity, dtypes, name=‘fifo_queue’)
创建一个以先进先出的顺序对元素进行排队的队列
- capacity：整数。可能存储在此队列中的元素数量的上限
- dtypes：DType对象列表。长度dtypes必须等于每个队列元素中的张量数,dtype的类型形状，决定了后面进队列元素形状
- method
- dequeue(name=None)
- enqueue(vals, name=None):
- enqueue_many(vals, name=None):vals列表或者元组
返回一个进队列操作
- size(name=None)

1.2. 完成一个出队列、+1、入队列操作(同步操作)

import tensorflow as tf

# 模拟一下同步先处理数据，然后才能取数据训练
# tensorflow当中， 运行操作有依赖性

# 1.首先定义一个队列
queue = tf.FIFOQueue(capacity=3, dtypes=tf.float32)
# 放入一些数据(注意： tensorflow中会把[0.1, 0.2,0.3]看成一个张量，需要后面加，)
enqueue_many = queue.enqueue_many([[0.1, 0.2, 0.3],])
# 2. 定义一些读取数据，取数据的过程， 取数据， +1, 入队列
out_q = queue.dequeue()
data = out_q + 1  # 重载
en_q = queue.enqueue(data) # 入队

# 模拟一下同步先处理数据， 然后才能去除数据训练
with tf.Session() as sess:
    sess.run(enqueue_many)
    # 处理数据
    for i in range(100):
        sess.run(en_q)
        
     # 训练数据
    for i in range(queue.size().eval()):
        print(sess.run(queue.dequeue()))

1.3. 注意点

在这里插入图片描述

2. 队列管理器和线程协调器案例

2.1. 队列管理

会话里可以运行多个线程，实现异步读取。
tf.train.QueueRunner(queue, enqueue_ops=None)
创建一个QueueRunner
- queue：A Queue
- enqueue_ops：添加线程的队列操作列表，[]*2,指定两个线程
- create_threads(sess, coord=None,start=False)
创建线程来运行给定会话的入队操作
- start：布尔值，如果True启动线程；如果为False调用者必须调用start()启动线程
- coord:线程协调器，后面线程管理需要用到
- return：

2.2. 线程协调器

tf.train.Coordinator()
- 线程协调员,实现一个简单的机制来协调一组线程的终止
- request_stop()
- should_stop() 检查是否要求停止
- join(threads=None, stop_grace_period_secs=120) 等待线程终止
- return:线程协调员实例

2.3. 代码实现

import tensorflow as tf

# 模拟异步子线程 存入样本， 主线程，读取样本

# 1.定一个队列， 1000
queue = tf.FIFOQueue(1000, tf.float32)
# 2.定义要做的事情 循环 值，+1, 放入队列当中
var = tf.Variable(0.0)
data = tf.assign_add(var, tf.constant(1.0))
en_q = queue.enqueue(data)
# 3.定义队列管理器op， 指定多少个子线程，子线程该干什么事情
qr = tf.train.QueueRunner(queue, enqueue_ops=[en_q] * 2)

# 初始化变量
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)
    
    # 开启线程管理器
    coord = tf.train.Coordinator()
    # 开启子线程，必须在会话中进行
    threads = qr.create_threads(sess, coord=coord, start=True)
    
    # 主线程，不断读取数据训练
    for i in  range(300):
        print(sess.run(queue.dequeue()))
        
    # 回收子线程
    coord.request_stop()
    coord.join(threads)

3. 文件读取流程

3.1. 读取步骤

构造一个文件队列
- 文件队列API
- tf.train.string_input_producer(string_tensor,shuffle=True)
  - 将输出字符串（例如文件名）输入到管道队列
  - string_tensor 含有文件名的1阶张量
  - num_epochs:过几遍数据，默认无限过数据
  - return:具有输出字符串的队列
读取队列内容
- 文件阅读器
- 根据文件格式，选择对应的文件阅读器
  - class tf.TextLineReader
    - 阅读文本文件逗号分隔值（CSV）格式,默认按行读取
    - return：读取器实例
  - tf.FixedLengthRecordReader(record_bytes)
    - 要读取每个记录是固定数量字节的二进制文件
    - record_bytes:整型，指定每次读取的字节数
    - return：读取器实例
  - tf.TFRecordReader
    - 读取TfRecords文件
  - 有一个共同的读取方法：
    - read(file_queue)：从队列中指定数量内容
    - 返回一个Tensors元组（key文件名字，value默认的内容(行，字节)）
解码
- 文件内容解码器
- 由于从文件中读取的是字符串，需要函数去解析这些字符串到张量
  - tf.decode_csv(records,record_defaults=None,field_delim = None，name = None)
    - 将CSV转换为张量，与tf.TextLineReader搭配使用
    - records:tensor型字符串，每个字符串是csv中的记录行
    - field_delim:默认分割符”,”
    - record_defaults:`参数决定了所得张量的类型，并设置一个值在输入字符串中缺少使用默认值
  - tf.decode_raw(bytes,out_type,little_endian = None，name = None)
    - 将字节转换为一个数字向量表示，字节为一字符串类型的张量,与函数tf.FixedLengthRecordReader搭配使用，二进制读取为uint8格式
开启线程操作
- tf.train.start_queue_runners(sess=None,coord=None)
  - 收集所有图中的队列线程，并启动线程
  - sess:所在的会话中
  - coord：线程协调器
  - return：返回所有线程队列
批处理
- 管道读端批处理
  - tf.train.batch(tensors,batch_size,num_threads = 1,capacity = 32,name=None)
    - 读取指定大小（个数）的张量
    - tensors：可以是包含张量的列表
    - batch_size:从队列中读取的批处理大小
    - num_threads：进入队列的线程数
    - capacity：整数，队列中元素的最大数量
    - return:tensors
- tf.train.shuffle_batch(tensors,batch_size,capacity,min_after_dequeue, num_threads=1,)
  - 乱序读取指定大小（个数）的张量
  - min_after_dequeue:留下队列里的张量个数，能够保持随机打乱

3.2. tensorflow 读取的特性：默认读取一个样本

CSV文件：读取一行
二进制文件：指定一个样本的bytes读取
图片读取：按一张一张的读取

3.3. 文件读取截图

在这里插入图片描述

4. 文件的简单读取

数据

A.csv

alpha1,A1
alpha2,A2
alpha3,A3

B.csv

beta1,B1
beta2,B2
beta3,B3

C.csv

c1,C1
c2,C2
c3,C3

csv文件读取代码实现

import tensorflow as tf
import os

# 找到文件，放入列表
file_names = os.listdir('./csv_data')
file_list = [os.path.join('./csv_data/', f) for f in file_names if f.endswith(".csv")]

# 构造文件队列
file_queue = tf.train.string_input_producer(file_list)
# 构造csv阅读器，读取队列数据（默认一行读取）
reader = tf.TextLineReader()

key, value = reader.read(file_queue)

# 对每行内容进行解码
# record_defaults:指定每一个样本的每一列的类型，指定默认值[["None"],[4.0]]
records = [["None"],["None"]]
example,label = tf.decode_csv(value, record_defaults=records)

# 想要读取多个数据， 就需要批处理
# 批处理大小和队列，数据的数量没有影响， 只取决于这批次取多少数据
example_batch, label_batch = tf.train.batch([example, label], batch_size=9, num_threads=1, capacity=9)

# 开启会话
with tf.Session() as sess:
    # 定义一个线程协调器
    coord = tf.train.Coordinator()
    # 开启读取文件的线程
    threads = tf.train.start_queue_runners(sess, coord=coord)
    # 打印读取的内容
    print(sess.run([example_batch, label_batch]))
    # 回收子线程
    coord.request_stop()
    coord.join(threads)

5. 图片的读取

5.1. 图像的基本知识

黑白图片，单通道：一个像素点只有一个值，灰度，范围是0~255
彩色图片，三通道：一个像素点有三个值组成RGB
图像数字化三要素
三要素与张量的关系

在这里插入图片描述

5.2. 图像的基本操作

目的
- 增加图片数据的统一性
- 所有图片转换成指定大小
- 缩小图片数据量，防止增加开销
操作
- 缩小图片大小

5.3. 图像基本操作API

tf.image.resize_images(images, size) 缩小图片
- images：4-D形状[batch, height, width, channels]或3-D形状的张量[height, width, channels]的图片数据
- size：1-D int32张量：new_height, new_width，图像的新尺寸
- 返回4-D格式或者3-D格式图片

5.4. 图片读取API

图像读取器
tf.WholeFileReader
- 将文件的全部内容作为值输出的读取器
- return：读取器实例
- read(file_queue):输出将是一个文件名（key）和该文件的内容（值）
图像解码器
tf.image.decode_jpeg(contents)
- 将JPEG编码的图像解码为uint8张量
- return:uint8张量，3-D形状[height, width, channels]
tf.image.decode_png(contents)
- 将PNG编码的图像解码为uint8或uint16张量
- return:张量类型，3-D形状[height, width, channels]

6. 图片批处理案例

6.1. 代码实现

import tensorflow as tf
import os

# 找到文件，放入列表
file_names = os.listdir('./dog_data')
file_list = [os.path.join('./dog_data/', f) for f in file_names if f.endswith(".jpg")]

# 构造文件队列
file_queue = tf.train.string_input_producer(file_list)

# 构造阅读器去读取图片内容（默认读取一张图片）
reader = tf.WholeFileReader()
key, value = reader.read(file_queue)
# 对读取的图片进行解码
image = tf.image.decode_jpeg(value)
# 5. 处理图片的大小(统一大小)
image_resize = tf.image.resize_images(image, [135, 40])
# 注意： 一定要把样本的形状固定， 在批处理的时候必须要求所有的数据形状必须固定
image_resize.set_shape([135, 40, 3])
print(image_resize)

# 想要读取多个数据， 就需要批处理
# 批处理大小和队列，数据的数量没有影响， 只取决于这批次取多少数据
image_batch = tf.train.batch([image_resize], batch_size=20, num_threads=1, capacity=20)
print(image_batch)

# 开启会话运行结果
with tf.Session() as sess:
    # 定义一个县城协调器
    coord = tf.train.Coordinator()
    
    # 开启读取文件的线程
    threads = tf.train.start_queue_runners(sess, coord=coord)
    # 打印读取的内容
    print(sess.run([image_batch]))
    
    # 回收子线程
    coord.request_stop()
    coord.join(threads)

6.2. 图片存储，计算的类型

存储：uint8（节约空间）
矩阵计算： float32（提高精度）

7. 二进制文件读取

cifar-10 链接：http://www.cs.toronto.edu/~kriz/cifar.html
代码实现

import os
import tensorflow as tf

class CifarRead(object):
    """完成读取二进制文件，写进tfrecords，读取tfrecords
    """
    def __init__(self, file_list):
        self.file_list = file_list
        
        # 定义读取的图片的一些属性
        self.height = 32
        self.width = 32
        self.channel = 3
        
        # 二进制文件每张图片的字节
        self.label_bytes = 1
        self.image_bytes = self.height * self.width * self.channel
        self.bytes = self.label_bytes + self.image_bytes
    
    def read_and_decode(self):
        # 构造文件队列
        file_queue = tf.train.string_input_producer(self.file_list)
        # 构造二进制文件读取器， 读取内容, 每个样本的字节数
        reader = tf.FixedLengthRecordReader(self.bytes)
        key, value = reader.read(file_queue)
        # 解码内容， 二进制文件内容的解码
        label_image = tf.decode_raw(value, tf.uint8)
        
        # 4. 分割出图片和标签数据， 切出特征值和目标值
        label = tf.cast(tf.slice(label_image, [0], [self.label_bytes]),tf.int32)
        
        image = tf.slice(label_image, [self.label_bytes], [self.image_bytes])
        
        # 5. 可以对图片的特征数据进行形状的改变[3072] --> [32,32,3]
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])
        
        # 6. 批处理数据
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10, num_threads=1, capacity=10)
        
        print(image_batch, label_batch)
        
        return image_batch, label_image

    
if __name__ == "__main__":
    # 找到文件，放入列表
    file_names = os.listdir('./cifar-10_data')
    file_list = [os.path.join('./cifar-10_data/', f) for f in file_names if f[0:4] == 'data']

    cf = CifarRead(file_list)
    
    image_batch, label_batch = cf.read_and_decode()
    
    # 开启会话运行结果
    with tf.Session() as sess:
        # 定义一个县城协调器
        coord = tf.train.Coordinator()

        # 开启读取文件的线程
        threads = tf.train.start_queue_runners(sess, coord=coord)
        # 打印读取的内容
        print(sess.run([image_batch, label_batch]))

        # 回收子线程
        coord.request_stop()
        coord.join(threads)

8. tfrecords文件的读取与存储

8.1. TFRecords分析、存取

TFRecords是Tensorflow设计的一种内置文件格式，是一种二进制文件，它能更好的利用内存，更方便复制和移动
为了将二进制数据和标签(训练的类别标签)数据存储在同一个文件中
文件格式：*.tfrecords
写入文件内容：Example协议块

8.2. TFRecords存储

8.2.1. 建立TFRecord存储器

tf.python_io.TFRecordWriter(path) 写入tfrecords文件
- path: TFRecords文件的路径
- return：写文件
method
- write(record):向文件中写入一个字符串记录
- close():关闭文件写入器

注：字符串为一个序列化的Example,Example.SerializeToString()

8.2.2. 构造每个样本的Example协议块

tf.train.Example(features=None)
- 写入tfrecords文件
- features:tf.train.Features类型的特征实例
- return：example格式协议块
tf.train.Features(feature=None)
构建每个样本的信息键值对
feature:字典数据,key为要保存的名字，
value为tf.train.Feature实例
return:Features类型
tf.train.Feature(**options)
**options：例如
- bytes_list=tf.train. BytesList(value=[Bytes])
- int64_list=tf.train. Int64List(value=[Value])
tf.train. Int64List(value=[Value])
tf.train. BytesList(value=[Bytes])
tf.train. FloatList(value=[value])

8.2.3. 代码实现

import os
import tensorflow as tf

class CifarRead(object):
    """完成读取二进制文件，写进tfrecords，读取tfrecords
    """
    def __init__(self, file_list):
        self.file_list = file_list
        
        # 定义读取的图片的一些属性
        self.height = 32
        self.width = 32
        self.channel = 3
        
        # 二进制文件每张图片的字节
        self.label_bytes = 1
        self.image_bytes = self.height * self.width * self.channel
        self.bytes = self.label_bytes + self.image_bytes
    
    def read_and_decode(self):
        # 构造文件队列
        file_queue = tf.train.string_input_producer(self.file_list)
        # 构造二进制文件读取器， 读取内容, 每个样本的字节数
        reader = tf.FixedLengthRecordReader(self.bytes)
        key, value = reader.read(file_queue)
        # 解码内容， 二进制文件内容的解码
        label_image = tf.decode_raw(value, tf.uint8)
        
        # 4. 分割出图片和标签数据， 切出特征值和目标值
        label = tf.cast(tf.slice(label_image, [0], [self.label_bytes]),tf.int32)
        
        image = tf.slice(label_image, [self.label_bytes], [self.image_bytes])
        
        # 5. 可以对图片的特征数据进行形状的改变[3072] --> [32,32,3]
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])
        
        # 6. 批处理数据
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10, num_threads=1, capacity=10)
        
        print(image_batch, label_batch)
        
        return image_batch, label_image
    
    def write_to_tfrecords(self, image_batch, label_batch):
        """将图片的特征值和目标值存入tfrecords
        """
        # 建立tfrecords存储器
        writer = tf.python_io.TFRecordWriter("./tmp/cifar-10.tfrecords")
        # 循环将所有样本写入文件，每张图片样本都要构造Example协议
        for i in range(10):
            # 取出第i个图片的特征值和目标值
            image = image_batch[i].eval().tostring()
            label = label_batch[i].eval()
            
            # 构造一个样本的Example协议
            example = tf.train.Example(features=tf.train.Features(feature={
                "image":tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                "label":tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
            }))
            
            # 写入单独的样本
            writer.write(example.SerializeToString())
        writer.close()

    
if __name__ == "__main__":
    # 找到文件，放入列表
    file_names = os.listdir('./cifar-10_data')
    file_list = [os.path.join('./cifar-10_data/', f) for f in file_names if f[0:4] == 'data']

    cf = CifarRead(file_list)
    
    image_batch, label_batch = cf.read_and_decode()
    
    # 开启会话运行结果
    with tf.Session() as sess:
        # 定义一个县城协调器
        coord = tf.train.Coordinator()

        # 开启读取文件的线程
        threads = tf.train.start_queue_runners(sess, coord=coord)
        
        # 存入tfrecords文件
        print("开始存储")
        cf.write_to_tfrecords(image_batch, label_batch)
        print("结束存储")
        
        # 打印读取的内容
        print(sess.run([image_batch, label_batch]))

        # 回收子线程
        coord.request_stop()
        coord.join(threads)

9. TFRecords读取

9.1. TFRecords读取方法

同文件阅读器流程,中间需要解析过程
解析TFRecords的example协议内存块
tf.parse_single_example(serialized,features=None,name=None)
- 解析一个单一的Example原型
- serialized：标量字符串Tensor，一个序列化的Example
- features：dict字典数据，键为读取的名字，值为FixedLenFeature
- return:一个键值对组成的字典，键为读取的名字
tf.FixedLenFeature(shape,dtype)
- shape：输入数据的形状，一般不指定,为空列表
- dtype：输入数据类型，与存储进文件的类型要一致
- 类型只能是float32,int64,string

9.2. 代码实现

import os
import tensorflow as tf

# 定义cifar命令行参数
FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string("cifar_dir", "./cifar-10_data/", "文件的目录")
tf.app.flags.DEFINE_string("cifar_tfrecords", "./tmp/cifar-10.tfrecords", "存进tfrecords的文件")


class CifarRead(object):
    """完成读取二进制文件，写进tfrecords，读取tfrecords
    """
    def __init__(self, file_list):
        self.file_list = file_list
        
        # 定义读取的图片的一些属性
        self.height = 32
        self.width = 32
        self.channel = 3
        
        # 二进制文件每张图片的字节
        self.label_bytes = 1
        self.image_bytes = self.height * self.width * self.channel
        self.bytes = self.label_bytes + self.image_bytes
    
    def read_and_decode(self):
        # 构造文件队列
        file_queue = tf.train.string_input_producer(self.file_list)
        # 构造二进制文件读取器， 读取内容, 每个样本的字节数
        reader = tf.FixedLengthRecordReader(self.bytes)
        key, value = reader.read(file_queue)
        # 解码内容， 二进制文件内容的解码
        label_image = tf.decode_raw(value, tf.uint8)
        
        # 4. 分割出图片和标签数据， 切出特征值和目标值
        label = tf.cast(tf.slice(label_image, [0], [self.label_bytes]),tf.int32)
        
        image = tf.slice(label_image, [self.label_bytes], [self.image_bytes])
        
        # 5. 可以对图片的特征数据进行形状的改变[3072] --> [32,32,3]
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])
        
        # 6. 批处理数据
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10, num_threads=1, capacity=10)
        
        print(image_batch, label_batch)
        
        return image_batch, label_image
    
    def write_to_tfrecords(self, image_batch, label_batch):
        """将图片的特征值和目标值存入tfrecords
        """
        # 建立tfrecords存储器
        writer = tf.python_io.TFRecordWriter(FLAGS.cifar_tfrecords)
        # 循环将所有样本写入文件，每张图片样本都要构造Example协议
        for i in range(10):
            # 取出第i个图片的特征值和目标值
            image = image_batch[i].eval().tostring()
            label = label_batch[i].eval()
            
            # 构造一个样本的Example协议
            example = tf.train.Example(features=tf.train.Features(feature={
                "image":tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                "label":tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
            }))
            
            # 写入单独的样本
            writer.write(example.SerializeToString())
        writer.close()
        
    def read_from_tfrecords(self):
        # 构造文件队列
        file_queue = tf.train.string_input_producer([FLAGS.cifar_tfrecords])
        # 构造文件阅读器， 读取内容example, value  就是一个example
        reader = tf.TFRecordReader()
        key , value = reader.read(file_queue)
        # 解析example
        features = tf.parse_single_example(value, features={
            "image":tf.FixedLenFeature([], tf.string),
            "label":tf.FixedLenFeature([], tf.int64),
        } )
        
        # 解码内容, 如果存取的格式是string类型的就需要解码，如果是int64，float32不需要解码
        image = tf.decode_raw(features['image'], tf.uint8)
        label = tf.cast(features['label'], tf.int32)
        
        # 固定图片的形状，方便批处理
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])
        
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10,num_threads=1, capacity=10)
        
        print(image_batch, label_batch)

        return image_batch, label_batch
    
if __name__ == "__main__":
    # 找到文件，放入列表
    file_names = os.listdir(FLAGS.cifar_dir)
    file_list = [os.path.join(FLAGS.cifar_dir, f) for f in file_names if f[0:4] == 'data']

    cf = CifarRead(file_list)
    
    image_batch, label_batch = cf.read_from_tfrecords()
    
    # 开启会话运行结果
    with tf.Session() as sess:
        # 定义一个县城协调器
        coord = tf.train.Coordinator()

        # 开启读取文件的线程
        threads = tf.train.start_queue_runners(sess, coord=coord)
        
        print(sess.run([image_batch, label_batch]))

        # 回收子线程
        coord.request_stop()
        coord.join(threads)

越奋斗，越幸运

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
tensorflow解决读取数据

1. 队列与线程1.1. tf.FIFOQueueFIFOQueue(capacity, dtypes, name=‘fifo_queue’)创建一个以先进先出的顺序对元素进行排队的队列capacity：整数。可能存储在此队列中的元素数量的上限dtypes：DType对象列表。长度dtypes必须等于每个队列元素中的张量数,dtype的类型形状，决定了后面进队列元素形状...
复制链接

扫一扫