tensorflow自定义op_如何在Tensorflow中使用自定义python函数预取数据

bd96500e110b49cbb3cd949968f18be7.png

I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow?

Update: I have a working example based on @mrry's example.

import numpy as np

import tensorflow as tf

import threading

BATCH_SIZE = 5

TRAINING_ITERS = 4100

feature_input = tf.placeholder(tf.float32, shape=[128])

label_input = tf.placeholder(tf.float32, shape=[128])

q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]])

enqueue_op = q.enqueue([label_input, feature_input])

label_batch, feature_batch = q.dequeue_many(BATCH_SIZE)

c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128])

sess = tf.Session()

def load_and_enqueue(sess, enqueue_op, coord):

with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file:

while not coord.should_stop():

feature_array = np.fromfile(feature_file, np.float32, 128)

if feature_array.shape[0] == 0:

print('reach end of file, reset using seek(0,0)')

feature_file.seek(0,0)

label_file.seek(0,0)

continue

label_value = np.fromfile(label_file, np.float32, 128)

sess.run(enqueue_op, feed_dict={feature_input: feature_array,

label_input: label_value})

coord = tf.train.Coordinator()

t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))

t.start()

for i in range(TRAINING_ITERS):

sum = sess.run(c)

print('train_iter='+str(i))

print(sum)

coord.request_stop()

coord.join([t])

解决方案

This is a common use case, and most implementations use TensorFlow's queues to decouple the preprocessing code from the training code. There is a tutorial on how to use queues, but the main steps are as follows:

Define a queue, q, that will buffer the preprocessed data. TensorFlow supports the simple tf.FIFOQueue that produces elements in the order they were enqueued, and the more advanced tf.RandomShuffleQueue that produces elements in a random order. A queue element is a tuple of one or more tensors (which can have different types and shapes). All queues support single-element (enqueue, dequeue) and batch (enqueue_many, dequeue_many) operations, but to use the batch operations you must specify the shapes of each tensor in a queue element when constructing the queue.

Build a subgraph that enqueues preprocessed elements into the queue. One way to do this would be to define some tf.placeholder() ops for tensors corresponding to a single input example, then pass them to q.enqueue(). (If your preprocessing produces a batch at once, you should use q.enqueue_many() instead.) You might also include TensorFlow ops in this subgraph.

Build a subgraph that performs training. This will look like a regular TensorFlow graph, but will get its input by calling q.dequeue_many(BATCH_SIZE).

Start your session.

Create one or more threads that execute your preprocessing logic, then execute the enqueue op, feeding in the preprocessed data. You may find the tf.train.Coordinator and tf.train.QueueRunner utility classes useful for this.

Run your training graph (optimizer, etc.) as normal.

EDIT: Here's a simple load_and_enqueue() function and code fragment to get you started:

# Features are length-100 vectors of floats

feature_input = tf.placeholder(tf.float32, shape=[100])

# Labels are scalar integers.

label_input = tf.placeholder(tf.int32, shape=[])

# Alternatively, could do:

# feature_batch_input = tf.placeholder(tf.float32, shape=[None, 100])

# label_batch_input = tf.placeholder(tf.int32, shape=[None])

q = tf.FIFOQueue(100, [tf.float32, tf.int32], shapes=[[100], []])

enqueue_op = q.enqueue([feature_input, label_input])

# For batch input, do:

# enqueue_op = q.enqueue_many([feature_batch_input, label_batch_input])

feature_batch, label_batch = q.dequeue_many(BATCH_SIZE)

# Build rest of model taking label_batch, feature_batch as input.

# [...]

train_op = ...

sess = tf.Session()

def load_and_enqueue():

with open(...) as feature_file, open(...) as label_file:

while True:

feature_array = numpy.fromfile(feature_file, numpy.float32, 100)

if not feature_array:

return

label_value = numpy.fromfile(feature_file, numpy.int32, 1)[0]

sess.run(enqueue_op, feed_dict={feature_input: feature_array,

label_input: label_value})

# Start a thread to enqueue data asynchronously, and hide I/O latency.

t = threading.Thread(target=load_and_enqueue)

t.start()

for _ in range(TRAINING_EPOCHS):

sess.run(train_op)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值