最近在用Tensoflow的MirroredStrategy进行深度网络的单机多GPU训练,本文分享一下使用过程中遇到的坑和解决办法。在Tensorflow中,分布式策略MirroredStrategy的与滑动指数平均ExponentialMovingAverage在设计上存在一定的矛盾。这个矛盾具体来说就是:MirroredStrategy会在多个scope中创建可reuse的variable,但当我们同时使用EMA时,它会生成一些错误的变量名,由此导致以下报错:
RuntimeError: Tried to create variable dense/kernel/replica_1/ExponentialMovingAverage/ with mismatching name on device 1
出错代码示例如下(没有使用我自己的代码,简易代码来自[1]):
import tensorflow as tf
def input_fn():
features = tf.data.Dataset.from_tensors([1., 2., 3.])
labels = tf.data.Dataset.from_tensors(1.)
dataset = tf.data.Dataset.zip((features, labels)).repeat(10).batch(1)
return dataset
def model_fn(features, labels, mode, params):
logits = tf.layers.dense(features, 1, activation=tf.nn.relu)
logits = tf.reshape(logits, (-1,))
loss = tf.losses.mean_squared_error(logits, labels)
ema = tf.train.ExponentialMovingAverage(0.999)
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0001)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
# apply moving averages
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
ema_update_op = ema.apply(tf.trainable_variables())
with tf.control_dependencies([train_op]):
train_op = tf.group(ema_update_op)
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
raise NotImplementedError
def train(model_dir):
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=tf.estimator.RunConfig(
train_distribute=distribution,
model_dir=model_dir,
log_step_count_steps=1,
),
)
estimator.train(input_fn=input_fn)
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.INFO)
train('training/test1')
简言之,出错来源于tf.estimator + tf.contrib.distribute.MirroredStrategy + tf.train.ExponentialMovingAverage同时使用
解决方法之一(在我的代码里work):
将ema定义放在model_fn之外,如下:
def input_fn():
features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
labels = tf.data.Dataset.from_tensors(1.).repeat(100)
return tf.data.Dataset.zip((features, labels))
ema = tf.train.ExponentialMovingAverage(decay=0.9999)
def model_fn(features, labels, mode):
regularizer = tf.contrib.layers.l2_regularizer(0.001)
layer = tf.compat.v1.layers.Dense(1, kernel_regularizer=regularizer)
logits = layer(features)
loss = tf.compat.v1.losses.mean_squared_error(labels, tf.reshape(logits, []))
global_step = tf.compat.v1.train.get_or_create_global_step()
# NOTE Here we set the global step for the ema
ema._num_updates = global_step
train_op = tf.compat.v1.train.GradientDescentOptimizer(0.2).minimize(loss, global_step=global_step)
ema_vars = tf.compat.v1.trainable_variables()
with tf.control_dependencies([train_op]):
train_op = ema.apply(ema_vars)
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)
config = tf.estimator.RunConfig(log_step_count_steps=10, train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn)
其他解决方法见参考资料[1],最后想多说一句,遇到代码上的问题要多学会访问国外的论坛,如stakoverflow和github,而不是只会用百度和CSDN
参考资料
[1] https://github.com/tensorflow/tensorflow/issues/27392