4.TensorFlow 读取数据

最新推荐文章于 2019-03-06 17:16:26 发布

_信念_

最新推荐文章于 2019-03-06 17:16:26 发布

阅读量213

点赞数

分类专栏： TensorFlow 文章标签： TensorFlow 数据加载

本文链接：https://blog.csdn.net/chen280085871/article/details/81269125

版权

TensorFlow 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1.读取数据的方式

feeding数据：在TensorFlow程序运行的每一步，让Python代码提供数据

从文件读取数据：在TensorFlow图的开始，让一个输入管线从文件中读取数据

预加载数据：在TensorFlow的图中定义常量或者变量来保存所有的数据，仅适用于数据量较小的情况

2.具体的实现

tf.placeholder（）占位符与feeding机制是结合配套使用的，其唯一的目的是为了提供数据供给的方法

2.1按读取文件的方式

import tensorflow as tf  
# 生成一个先入先出队列和一个QueueRunner,生成文件名队列  
filenames = ['D.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
# 定义Reader  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
# 定义Decoder  
record_defaults = [[1], [1], [1], [1], [1]] #解析为整数
col1, col2, col3, col4, col5 = tf.decode_csv(value,record_defaults=record_defaults)  
features = tf.stack([col1, col2, col3])#前3列数据，后2列标签  
label = tf.stack([col4,col5])  
example_batch, label_batch = tf.train.shuffle_batch([features,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2)  
# 运行Graph  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  #创建一个协调器，管理线程  
    threads = tf.train.start_queue_runners(coord=coord)  
    for i in range(10):  
        e_val,l_val = sess.run([example_batch, label_batch])  
        print (e_val,l_val)  
    coord.request_stop()  
    coord.join(threads)

上面的样例中必须有启动线程的操作，以及最后结束线程的操作。

tf.train.string_input_producer会创建一个QueueRunner，是一个线程，同样，tf.decode_csv这个文本的reader会产生一个线程，两个线程之间是不会相互阻塞的，同时创建的线程管理器也是管理queuerunner这个线程的。

在以读取文件的方式获得数据然后再制作成可以传入feed_dict的batch时，需要严格指定读取的文件被读取的次数，另外是先有queue这个线程。

注意：在进行run()或者eval()之前，必须要把文件加入到Queue中去，也就是tf.train.start_queue_runners（coord=coord）必须得有。

2.2TFRecords数据格式

tensorflow有专门的数据格式，可以把所有的数据都转化为 tfrecords 的形式，先要把任意数据转化为这种格式，然后在用的时候，又把这种格式的数据取出来，mnist数据集就采用了这种方式。

3.预处理

归一化，降噪等处理

4.批处理

4.1

def read_my_file_format(filename_queue):
  reader = tf.SomeReader()
  key, record_string = reader.read(filename_queue)
  example, label = tf.some_decoder(record_string)
  processed_example = some_processing(example)
  return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer(
      filenames, num_epochs=num_epochs, shuffle=True)
  example, label = read_my_file_format(filename_queue)
  # min_after_dequeue defines how big a buffer we will randomly sample
  #   from -- bigger means better shuffling but slower start up and more
  #   memory used.
  # capacity must be larger than min_after_dequeue and the amount larger
  #   determines the maximum we will prefetch.  Recommendation:
  #   min_after_dequeue + (num_threads + a small safety margin) * batch_size
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

在此处是把数据从文件中读取出来，并且划分出batch用于训练操作，需要注意的是capacity等的设置。

4.2

为了在不同文件中的例子有更强的乱序和并行的读取文件，可以采用tf.train. shuffle_batch_join()来进行batch的划分，相应的，读取数据也稍微有一些不同。

def read_my_file_format(filename_queue):
  # Same as above

def input_pipeline(filenames, batch_size, read_threads, num_epochs=None):
  filename_queue = tf.train.string_input_producer(
      filenames, num_epochs=num_epochs, shuffle=True)
  example_list = [read_my_file_format(filename_queue)
                  for _ in range(read_threads)]
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch_join(
      example_list, batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

读取文件中的内容变为了列表的方式。

这样做的好处是，可以同时向多个文件读取数据，是多线程同时进行的

另外还可以通过在tf.train.shuffle_batch()的num_threads设置大于1，每次都只是在一个文件读取数据。方案的优点：

避免了两个不同的线程从同一个文件中读取同一个样本。
避免了过多的磁盘搜索操作。

避免的是使用不同线程读取相同的文件，那么是允许有多个线程去读取同一文件的不同内容的。

那么我们一共需要多少个线程呢？？
首先要把文件加入到queue中，需要一个线程，然后是在把读取的数据划分为batch的时候需要线程。这两种当时都是自己可以设置线程数的。

4.3

当读取文件的队列达到了最大迭代次数之后，与之对应的线程就会关闭。

# Create the graph, etc.
init_op = tf.initialize_all_variables()

# Create a session for running operations in the Graph.
sess = tf.Session()

# Initialize the variables (like the epoch counter).
sess.run(init_op)

# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
    while not coord.should_stop():
        # Run training steps or whatever
        sess.run(train_op)

except tf.errors.OutOfRangeError:
    print 'Done training -- epoch limit reached'
finally:
    # When done, ask the threads to stop.
    coord.request_stop()

# Wait for threads to finish.
coord.join(threads)
sess.close()

另外其他的线程还不会停止，还会继续运行，因为许多其他的队列中还有可以取得元素，那么其他的线程还可以继续执行。

但是其他类型的错误将导致所有的线程停止。

在queue队列的线程停止之后，然后是reader的线程，再然后是example队列的线程，也就是说，线程之间会发生通信，当example队列被关闭之后，即使还未达到mini_mini_after_dequeue的大小，那么线程也会退出去，因为，线程之间的通行所决定的。样本队列的容量一般为

capacity = min_after_dequeue + 3 * batch_size

4.4预取数据

只是针对数据量不大的情况，可以直接全部读取，然后也可以通过 tf.train.slice_input_producer()函数把这个输入的数据放入队列中，然后在对这个数据进行batch划分操作，然后用于batch的训练。

5.多个管道输入数据

也就是在进行训练的过程中想要边训练，边进行评价的时候，就需要我们共享当前的模型，执行训练的过程中，其实会自动更新模型的参数值。所以很方便进行eval的要求。

_信念_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
4.TensorFlow 读取数据

1.读取数据的方式 feeding数据：在TensorFlow程序运行的每一步，让Python代码提供数据从文件读取数据：在TensorFlow图的开始，让一个输入管线从文件中读取数据预加载数据：在TensorFlow的图中定义常量或者变量来保存所有的数据，仅适用于数据量较小的情况2.具体的实现 tf.placeholder（）占位符与feeding...
复制链接

扫一扫

专栏目录