因为TensorFlow训练时默认占用所有GPU的显存。
这样如果有人想使用其他两个GPU跑程序,就会因为显存不足而无法运行。
所以需要人为指定显存占用率或指定使用单张显卡。
一、根据 TF官网tutorial部分的Using GPUs,可以总结四种方法:
-
第一种是使用
allow_growth
,实现显存运行时分配。当allow_growth设置为True时,TF程序一开始被分配很少的显存,在运行时根据需求增长而扩大显存的占用。config = tf.ConfigProto() config.gpu_options.allow_growth = True sess = tf.Session(config=config, ...)
-
第二种是使用
per_process_gpu_memory_fraction
,指定每个可用GPU的显存分配率。在构造tf.Session()时候通过tf.GPUOptions配置参数,显式地指定需要分配的显存比例。#告诉TF它可以使用每块GPU显存的40% config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config=config, ...)
这种方法指定了每个GPU进程的显存占用上限,但它会同时作用于所有GPU,不能对不同GPU设置不同的上限。
-
在运行训练程序前,在用户根目录下配置环境(~/.bashrc):
import os os.environ["CUDA_VISIBLE_DEVICES"] = '0,1'
或:
export CUDA_VISIBLE_DEVICES = NUM
NUM是用户指定显卡的序号(0,1,2…),可以先用
nvidia-smi
查看当前哪块显卡可用。但这种方法限制了用户可见的GPU数量,比如你的其他程序在你的目录里无法选择别的GPU; 你的程序也没法使用multiple GPUs。 -
收集空闲GPU,按需分配指定数量
import GPUtil
import os
# 空闲GPU收集
g_c = 3
deviceIDs = GPUtil.getAvailable(order = 'first', limit = 8, maxLoad = 0.01, maxMemory = 0.01, includeNan=False, excludeID=[], excludeUUID=[])
deviceIDs = [6] if not deviceIDs else deviceIDs
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(gp) for gp in deviceIDs[:g_c]])
g_c = len(deviceIDs) if len(deviceIDs)< g_c else g_c # 实际空闲的GPU个数
print("free GPUs", deviceIDs)
二、MNIST多GPU训练
多GPU并行可分为模型并行和数据并行两大类,上图展示的是数据并行,这也是我们经常用到的方法,而其中数据并行又可分为同步方式和异步方式两种,由于我们一般都会配置同样的显卡,因此这儿也选择了同步方式,也就是把数据分给不同的卡,等所有的GPU都计算完梯度后进行平均,最后再更新梯度。
首先要改造的就是数据读取部分,由于现在我们有多快卡,每张卡要分到不同的数据,所以在获取batch的时候要把大小改为batch_x,batch_y=mnist.train.next_batch(batch_size*num_gpus),一次取足够的数据保证每块卡都分到batch_size大小的数据。然后我们对取到的数据进行切分,我们以i表示GPU的索引,连续的batch_size大小的数据分给同一块GPU:
_x=X[i*batch_size:(i+1)*batch_size]
_y=Y[i*batch_size:(i+1)*batch_size]
由于我们多个GPU上共享同样的图,为了防止名字混乱,最好使用name_scope进行区分,也就是如下的形式:
for i in range(2):
with tf.device("/gpu:%d"%i):
with tf.name_scope("tower_%d"%i):
_x=X[i*batch_size:(i+1)*batch_size]
_y=Y[i*batch_size:(i+1)*batch_size]
logits=conv_net(_x,dropout,reuse_vars,True)
我们需要有个列表存储所有GPU上的梯度,还有就是复用变量,需要在之前定义如下两个值:
tower_grads=[]
reuse_vars=False
所有的准备工作都已完成,就可以计算每个GPU上的梯度了
opt = tf.train.AdamOptimizer(learning_rate)
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=_y,logits=logits))
grads=opt.compute_gradients(loss)
reuse_vars=True
tower_grads.append(grads)
这样tower_grads就存储了所有GPU上所有变量的梯度,下面就是计算平均值了,这个是所有见过的函数中唯一一个几乎从没变过的代码:
def average_gradients(tower_grads):
average_grads=[]
for grad_and_vars in zip(*tower_grads):
grads=[]
for g,_ in grad_and_vars:
expend_g=tf.expand_dims(g,0)
grads.append(expend_g)
grad=tf.concat(grads,0)
grad=tf.reduce_mean(grad,0)
v=grad_and_vars[0][1]
grad_and_var=(grad,v)
average_grads.append(grad_and_var)
return average_grads
tower_grads里面保存的形式是(第一个GPU上的梯度,第二个GPU上的梯度,...第N-1个GPU上的梯度),这里有一点需要注意的是zip(*),它的作用上把上面的那个列表转换成((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))的形式,也就是以列访问的方式,取到的就是某个变量在不同GPU上的值。
最后就是更新梯度了
grads=average_gradients(tower_grads)
train_op=opt.apply_gradients(grads)
上面的讲述略有零散,最后我们给个全代码版本方便大家测试:
import time
import numpy as np
import tensorflow as tf
from tensorflow.contrib import slim
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("mnist/", one_hot=True)
def get_available_gpus():
"""
code from http://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow
"""
from tensorflow.python.client import device_lib as _device_lib
local_device_protos = _device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
num_gpus = len(get_available_gpus())
print("Available GPU Number :"+str(num_gpus))
num_steps = 1000
learning_rate = 0.001
batch_size = 1000
display_step = 10
num_input = 784
num_classes = 10
def conv_net_with_layers(x,is_training,dropout = 0.75):
with tf.variable_scope("ConvNet", reuse=tf.AUTO_REUSE):
x = tf.reshape(x, [-1, 28, 28, 1])
x = tf.layers.conv2d(x, 12, 5, activation=tf.nn.relu)
x = tf.layers.max_pooling2d(x, 2, 2)
x = tf.layers.conv2d(x, 24, 3, activation=tf.nn.relu)
x = tf.layers.max_pooling2d(x, 2, 2)
x = tf.layers.flatten(x)
x = tf.layers.dense(x, 100)
x = tf.layers.dropout(x, rate=dropout, training=is_training)
out = tf.layers.dense(x, 10)
out = tf.nn.softmax(out) if not is_training else out
return out
def conv_net(x,is_training):
# "updates_collections": None is very import ,without will only get 0.10
batch_norm_params = {"is_training": is_training, "decay": 0.9, "updates_collections": None}
#,'variables_collections': [ tf.GraphKeys.TRAINABLE_VARIABLES ]
with slim.arg_scope([slim.conv2d, slim.fully_connected],
activation_fn=tf.nn.relu,
weights_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.01),
weights_regularizer=slim.l2_regularizer(0.0005),
normalizer_fn=slim.batch_norm, normalizer_params=batch_norm_params):
with tf.variable_scope("ConvNet",reuse=tf.AUTO_REUSE):
x = tf.reshape(x, [-1, 28, 28, 1])
net = slim.conv2d(x, 6, [5,5], scope="conv_1")
net = slim.max_pool2d(net, [2, 2],scope="pool_1")
net = slim.conv2d(net, 12, [5,5], scope="conv_2")
net = slim.max_pool2d(net, [2, 2], scope="pool_2")
net = slim.flatten(net, scope="flatten")
net = slim.fully_connected(net, 100, scope="fc")
net = slim.dropout(net,is_training=is_training)
net = slim.fully_connected(net, num_classes, scope="prob", activation_fn=None,normalizer_fn=None)
return net
def average_gradients(tower_grads):
average_grads = []
for grad_and_vars in zip(*tower_grads):
grads = []
for g, _ in grad_and_vars:
expend_g = tf.expand_dims(g, 0)
grads.append(expend_g)
grad = tf.concat(grads, 0)
grad = tf.reduce_mean(grad, 0)
v = grad_and_vars[0][1]
grad_and_var = (grad, v)
average_grads.append(grad_and_var)
return average_grads
PS_OPS = ['Variable', 'VariableV2', 'AutoReloadVariable']
def assign_to_device(device, ps_device='/cpu:0'):
def _assign(op):
node_def = op if isinstance(op, tf.NodeDef) else op.node_def
if node_def.op in PS_OPS:
return "/" + ps_device
else:
return device
return _assign
def train():
with tf.device("/cpu:0"):
global_step=tf.train.get_or_create_global_step()
tower_grads = []
X = tf.placeholder(tf.float32, [None, num_input])
Y = tf.placeholder(tf.float32, [None, num_classes])
opt = tf.train.AdamOptimizer(learning_rate)
with tf.variable_scope(tf.get_variable_scope()):
for i in range(num_gpus):
with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device='/cpu:0')):
_x = X[i * batch_size:(i + 1) * batch_size]
_y = Y[i * batch_size:(i + 1) * batch_size]
logits = conv_net(_x, True)
tf.get_variable_scope().reuse_variables()
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=_y, logits=logits))
grads = opt.compute_gradients(loss)
tower_grads.append(grads)
if i == 0:
logits_test = conv_net(_x, False)
correct_prediction = tf.equal(tf.argmax(logits_test, 1), tf.argmax(_y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
grads = average_gradients(tower_grads)
train_op = opt.apply_gradients(grads)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for step in range(1, num_steps + 1):
batch_x, batch_y = mnist.train.next_batch(batch_size * num_gpus)
ts = time.time()
sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
te = time.time() - ts
if step % 10 == 0 or step == 1:
loss_value, acc = sess.run([loss, accuracy], feed_dict={X: batch_x, Y: batch_y})
print("Step:" + str(step) + ":" + str(loss_value) + " " + str(acc)+", %i Examples/sec" % int(len(batch_x)/te))
print("Done")
print("Testing Accuracy:",
np.mean([sess.run(accuracy, feed_dict={X: mnist.test.images[i:i + batch_size],
Y: mnist.test.labels[i:i + batch_size]}) for i in
range(0, len(mnist.test.images), batch_size)]))
def train_single():
X = tf.placeholder(tf.float32, [None, num_input])
Y = tf.placeholder(tf.float32, [None, num_classes])
logits=conv_net(X,True)
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y,logits=logits))
opt=tf.train.AdamOptimizer(learning_rate)
train_op=opt.minimize(loss)
logits_test=conv_net(X,False)
correct_prediction = tf.equal(tf.argmax(logits_test, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for step in range(1,num_steps+1):
batch_x, batch_y = mnist.train.next_batch(batch_size)
sess.run(train_op,feed_dict={X:batch_x,Y:batch_y})
if step%display_step==0 or step==1:
loss_value,acc=sess.run([loss,accuracy],feed_dict={X:batch_x,Y:batch_y})
print("Step:" + str(step) + ":" + str(loss_value) + " " + str(acc))
print("Done")
print("Testing Accuracy:",np.mean([sess.run(accuracy, feed_dict={X: mnist.test.images[i:i + batch_size],
Y: mnist.test.labels[i:i + batch_size]}) for i in
range(0, len(mnist.test.images), batch_size)]))
if __name__ == "__main__":
#train_single()
train()
如果有多张卡可以再写个脚本控制使用哪张卡
export CUDA_VISIBLE_DEVICES=1,2
python train.py