Tensorflow数据IO的三种方式

最新推荐文章于 2023-08-02 16:26:14 发布

thinker_1120

最新推荐文章于 2023-08-02 16:26:14 发布

阅读量1.1k

点赞数

分类专栏： Keras和Tensorflow 文章标签： tensorflow

本文链接：https://blog.csdn.net/cymy001/article/details/78715570

版权

Keras和Tensorflow 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

Tensorflow数据IO的三种方式

(1) 数据直接嵌入graph，由graph传入session中运行 ——> constsnt
(2) 用placeholder占位符代替数据，运行时用feed_dict填入数据 ——> graph程序copy到不同机器，数据在local读(一个session只能执行一个graph，一个graph可以传输给多个session)
(3) Pipeline：用Queue机制异步的方式(生产者消费者模式)实现数据IO ——>
tf.TextLinereader()每次读一行
tf.WholeFileReader()每次读整个文件
tf.TFRecordReader()每次读一个record示例(protobuf)

生产者消费者模式——有两个工作模块1和模块2，模块1的产出存到某个空间A，模块2不与模块1直接交互，模块2直接到空间A去取(只有空间A有可输入的数据，模块2就从这取)，这样从某种程度上将模块1和模块2分离开，从而实现异步机制，无等待时间。

# (1)
import tensorflow as tf
x=tf.constant([1,2,3],name='x')
y=tf.constant([2,3,4],name='y')
z=tf.add(x,y,name='z')
with tf.Session() as sess:
    print(sess.run(z))
# [3 5 7]

# (2)
import tensorflow as tf
x=tf.placeholder(tf.int16)
y=tf.placeholder(tf.int16)
z=tf.add(x,y,name='z')
with tf.Session() as sess:
    xs=[1,2,3]
    ys=[2,3,4]
    print(sess.run(z,feed_dict={x:xs,y:ys}))
# [3 5 7]

# (3)
import tensorflow as tf
#定义文件list
filenames = tf.train.match_filenames_once('.\data\*.csv')
#filenames是文件集合构成的list，或者直接filenames=['A.csv', 'B.csv']

#定义一个queue
filename_queue = tf.train.string_input_producer(filenames, shuffle=False, num_epochs=3)   
#shuffle是否打乱文件次序，num_epochs是文件集合里的每个文件重复使用几次

#定义一个reader
reader=tf.TextLineReader()
_,value=reader.read(filename_queue)   #把queue喂给reader，返回key和value，value就是读到的那一行的内容

#【3.a】
example,label = tf.decode_csv(value,record_defaults=[['null'],['null']])   #csv文件数据只有两列，解析数据


init_op = tf.local_variables_initializer()
with tf.Session() as sess:   #session是一个进程，queue是又一个进程
    sess.run(init_op)
    coord=tf.train.Coordinator()   #线程管理协调器
    threads=tf.train.start_queue_runners(coord=coord)   #启动QueueRunner
    #queue在本地是一个线程，在分布式系统不同机器上 相对接受机器是进程
    for _ in range(5):
        print(sess.run([example,label]))
    
    coord.request_stop()
    coord.join(threads) 

# [b'A1', b'a1']
# [b'A2', b'a2']
# [b'A3', b'a3']
# [b'B1', b'b1']
# [b'B2', b'b2']

session() run时，model不断从内存queue里拿数据，如果内存queue内没东西会报错out of range error。
文件queue，Reader不断根据之前定义的filename、num_epochs、shuffle把文件加到文件queue里。

#【3.b】
example_batch, label_batch = tf.train.shuffle_batch([example, label], batch_size=5, capacity=3*batch_size)
#每读够batch_size大小的record，往内存里丢一次；capacity是内存queue的大小
#还有个参数min_after_dequeue表示一旦内存里record数少于该值，就不再shuffle

#【3.c】
record_list = [tf.devode_csv(value,record_defaults=[['null'],['null']]) for _ in range(2)]   #用两个reader在读
example_batch, label_batch = tf.train.batch_join(record_list, batch_size=5)

上述写法，内存queue里的record是有序的，因为来一个record就decode丢到内存里【3.a】
也可以多读几个record，再往内存里丢，比如按batch取。——用一个reader 【3.b】
也可以通过多个reader读【3.c】

TFRecord介绍

Tensorflow统一的数据输入格式TFRecord，本质是protobuf，提高传输效率
1.统一不同输入文件的框架
2.节约空间(TFRecord压缩的二进制文件，protocal buffer序列化)
在这里插入图片描述

数据准备流程

（1）把csv文件里的raw_data 逐行转换成tfrecord
（2）tfrecord数据读取——用queue机制读取

## convert csv files to tfrecord
import tensorflow as tf
import numpy as np
import pandas as pd

train_frame = pd.read_csv("train.csv")
print(train_frame.head())
train_labels_frame = train_frame.pop(item="label")
train_values = train_frame.values
train_labels = train_labels_frame.values
print("values shape: ", train_values.shape)
print("labels shape:", train_labels.shape)

writer = tf.python_io.TFRecordWriter("csv_train.tfrecords")

for i in range(train_values.shape[0]):
    image_raw = train_values[i].tostring()
    example = tf.train.Example(   #对找protobuf的格式解析此结构   
        features=tf.train.Features(
            feature={
                "image_raw": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_raw])),
                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[train_labels[i]]))
            }
        )
    )
    writer.write(record=example.SerializeToString())
    
writer.close()

import tensorflow as tf

reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(["csv_train.tfrecords"])

_, serialized_record = reader.read(filename_queue)

features = tf.parse_single_example(
    serialized_record,
    features={
        ## tf.FixedLenFeature return Tensor
        ## tf.VarLenFeature return SparseTensor
        "image_raw": tf.FixedLenFeature([], tf.string),
        "label": tf.FixedLenFeature([], tf.int64),
    })

images = tf.decode_raw(features["image_raw"], tf.uint8)
labels = tf.cast(features["label"], tf.int32)

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    for i in range(10):
        image, label = sess.run([images, labels])