tensorflow自定义op_如何在Tensorflow中使用自定义python函数预取数据

最新推荐文章于 2022-03-31 00:45:08 发布

MJ勺子

最新推荐文章于 2022-03-31 00:45:08 发布

阅读量169

点赞数

文章标签： tensorflow自定义op

本文链接：https://blog.csdn.net/weixin_36228334/article/details/113021065

版权

I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow?

Update: I have a working example based on @mrry's example.

import numpy as np

import tensorflow as tf

import threading

BATCH_SIZE = 5

TRAINING_ITERS = 4100

feature_input = tf.placeholder(tf.float32, shape=[128])

label_input = tf.placeholder(tf.float32, shape=[128])

q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]])

enqueue_op = q.enqueue([label_input, feature_input])

label_batch, feature_batch = q.dequeue_many(BATCH_SIZE)

c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128])

sess = tf.Session()

def load_and_enqueue(sess, enqueue_op, coord):

with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file:

while not coord.should_stop():

feature_array = np.fromfile(feature_file, np.float32, 128)

if feature_array.shape[0] == 0:

print('reach end of file, reset using seek(0,0)')

feature_file.seek(0,0)

label_file.seek(0,0)

continue

label_value = np.fromfile(label_file, np.float32, 128)

sess.run(enqueue_op, feed_dict={feature_input: feature_array,

label_input: label_value})

coord = tf.train.Coordinator()

t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))

t.start()

for i in range(TRAINING_ITERS):

sum = sess.run(c)

print('train_iter='+str(i))

print(sum)

coord.request_stop()

coord.join([t])

解决方案

This is a common use case, and most implementations use TensorFlow's queues to decouple the preprocessing code from the training code. There is a tutorial on how to use queues, but the main steps are as follows:

Define a queue, q, that will buffer the preprocessed data. TensorFlow supports the simple tf.FIFOQueue that produces elements in the order they were enqueued, and the more advanced tf.RandomShuffleQueue that produces elements in a random order. A queue element is a tuple of one or more tensors (which can have different types and shapes). All queues support single-element (enqueue, dequeue) and batch (enqueue_many, dequeue_many) operations, but to use the batch operations you must specify the shapes of each tensor in a queue element when constructing the queue.

Build a subgraph that enqueues preprocessed elements into the queue. One way to do this would be to define some tf.placeholder() ops for tensors corresponding to a single input example, then pass them to q.enqueue(). (If your preprocessing produces a batch at once, you should use q.enqueue_many() instead.) You might also include TensorFlow ops in this subgraph.

Build a subgraph that performs training. This will look like a regular TensorFlow graph, but will get its input by calling q.dequeue_many(BATCH_SIZE).

Start your session.

Create one or more threads that execute your preprocessing logic, then execute the enqueue op, feeding in the preprocessed data. You may find the tf.train.Coordinator and tf.train.QueueRunner utility classes useful for this.

Run your training graph (optimizer, etc.) as normal.

EDIT: Here's a simple load_and_enqueue() function and code fragment to get you started:

# Features are length-100 vectors of floats

feature_input = tf.placeholder(tf.float32, shape=[100])

# Labels are scalar integers.

label_input = tf.placeholder(tf.int32, shape=[])

# Alternatively, could do:

# feature_batch_input = tf.placeholder(tf.float32, shape=[None, 100])

# label_batch_input = tf.placeholder(tf.int32, shape=[None])

q = tf.FIFOQueue(100, [tf.float32, tf.int32], shapes=[[100], []])

enqueue_op = q.enqueue([feature_input, label_input])

# For batch input, do: