tensorflow多线程并发训练_[TensorFlow] 利用tensorflow中的队列和多线程读取数据来加快模型训练速度...

37eefd05a0e0260ff0e0df474a95288f.png

框架总结

  • 队列类型
FIFOQueue
  • 出入对列操作
enqueue
  • Coordinator:线程管理器
    • should_stop():如果线程应该停止则返回True
    • request_stop(<exception>):请求该线程停止
    • join(<list of threads>):等待被指定的子线程终止(才开始继续主线程)
    • 步骤
      • 首先创建一个Coordinator对象,然后建立一些使用Coordinator对象的线程
      • 这些线程通常一直循环运行,一直到should_stop()返回True时停止
      • 任何线程都可以决定计算什么时候应该停止。它只需要调用request_stop(),同时其他线程的should_stop()将会返回True,然后都停下来
# 线程体:循环执行,直到`Coordinator`收到了停止请求。 # 如果某些条件为真,请求`Coordinator`去停止其他线程。

上面的代码只是一个简单的例子,在设计实现的时候不必完全照搬,实际中Coordinator会置入queue中,负责在得到线程关闭的请求后,关闭queue启动的多个线程

  • QueueRunner:队列管理器
    • 创建并启动单个队列管理器的多个线程
qr 
      • When you later call the create_threads() method, the QueueRunner will create one thread for each op in enqueue_ops.Each thread will run its enqueue op in parallel with the other threads.
      • If a coordinator is given, this method starts an additional thread to close the queue when the coordinator requests a stop or exception error
    • 创建并启动多个队列管理器的多个线程
qr1 = tf.train.QueueRunner(queue, [enqueue_op] * 4)
qr2 = tf.train.QueueRunner(queue, [enqueue_op] * 4)
tf.train.add_queue_runner(qr1) 
tf.train.add_queue_runner(qr2) 
threads = tf.train.start_queue_runners(sess, coord=coord)
coord.request_stop()
coord.join(threads)
  • 异常处理
    • 通过queue runners启动的线程不仅仅只处理推送样本到队列。他们还捕捉和处理由队列产生的异常,包括OutOfRangeError异常,这个异常是用于报告队列被关闭
    • 使用Coordinator的训练程序在主循环中必须同时捕捉和报告异常
下面是对上面训练循环的改进版本try:
    for step in xrange(1000000):
        if coord.should_stop():
            break
        sess.run(train_op)
except Exception, e:
# Report exceptions to the coordinator.
    coord.request_stop(e)

# Terminate as usual.  It is innocuous to request stop twice.
coord.request_stop()
coord.join(threads)

利用队列、多线程进行数据读取的完整代码示例

# coding=utf-8
import time
import tensorflow as tf

# We simulate some raw input data 
# (think about it as fetching some data from the file system)
# let's say: batches of 128 samples, each containing 1024 data points
x_input_data = tf.random_normal([128, 1024], mean=0, stddev=1)

# We build our small model: a basic two layers neural net with ReLU
with tf.variable_scope("queue"):
    q = tf.FIFOQueue(capacity=5, dtypes=tf.float32) # enqueue 5 batches
    # We use the "enqueue" operation so 1 element of the queue is the full batch
    enqueue_op = q.enqueue(x_input_data)
    numberOfThreads = 1
    qr = tf.train.QueueRunner(q, [enqueue_op] * numberOfThreads)
    tf.train.add_queue_runner(qr)
    input = q.dequeue() # It replaces our input placeholder
    # We can also compute y_true right into the graph now
    y_true = tf.cast(tf.reduce_sum(input, axis=1, keep_dims=True) > 0, tf.int32)

with tf.variable_scope('FullyConnected'):
    w = tf.get_variable('w', shape=[1024, 1024], initializer=tf.random_normal_initializer(stddev=1e-1))
    b = tf.get_variable('b', shape=[1024], initializer=tf.constant_initializer(0.1))
    z = tf.matmul(input, w) + b
    y = tf.nn.relu(z)

    w2 = tf.get_variable('w2', shape=[1024, 1], initializer=tf.random_normal_initializer(stddev=1e-1))
    b2 = tf.get_variable('b2', shape=[1], initializer=tf.constant_initializer(0.1))
    z = tf.matmul(y, w2) + b2

with tf.variable_scope('Loss'):
    losses = tf.nn.sigmoid_cross_entropy_with_logits(None, tf.cast(y_true, tf.float32), z)
    loss_op = tf.reduce_mean(losses)

with tf.variable_scope('Accuracy'):
    y_pred = tf.cast(z > 0, tf.int32)
    accuracy = tf.reduce_mean(tf.cast(tf.equal(y_pred, y_true), tf.float32))
    accuracy = tf.Print(accuracy, data=[accuracy], message="accuracy:")

# We add the training op ...
adam = tf.train.AdamOptimizer(1e-2)
train_op = adam.minimize(loss_op, name="train_op")

startTime = time.time()
with tf.Session() as sess:
    # ... init our variables, ...
    sess.run(tf.global_variables_initializer())

    # ... add the coordinator, ...
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    # ... check the accuracy before training (without feed_dict!), ...
    sess.run(accuracy)

    # ... train ...
    # 此处使用try语句,其实是没有必要的,但是实际中使用tf.train.QueueRunner作为文件队列是设置了最大训练迭代数,在文件队列的出队操作数大于"num_epoches*队列容量",从文件队列读取的操作会读到一个"EOF",这样最后一个样本出队的操作会得到一个tf.OutOfRangeError的错误的
    try:
        for i in range(5000):
            #  ... without sampling from Python and without a feed_dict !
            _, loss = sess.run([train_op, loss_op])

            # We regularly check the loss
            if i % 500 == 0:
                print('iter:%d - loss:%f' % (i, loss))
    except tf.errors.OutOfRangeError:
        print 'Done training -- epoch limit reached'
    finally:
        # When done, ask the threads to stop.
        coord.request_stop()
 
    coord.join(threads)
    
    # Finally, we check our final accuracy
    sess.run(accuracy)

print("Time taken: %f" % (time.time() - startTime))

优势

  • 异步训练和数据读取,加快训练速度
  • 提高GPU利用率

参考

  • 官网教程
  • TensorFlow学习系列(五):如何使用队列和多线程优化输入管道
  • 关于QueueRunner的更多细节
    • 模块tf.train.queue_runner
    • 类tf.train.QueueRunner(也等于class tf.train.queue_runner.QueueRunner)及其类方法create_threads
    • 模块中队列操作tf.train.start_queue_runners
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值