Tensoflow MirroredStrategy 策略下 ExponentialMovingAverage出错解决方法

最新推荐文章于 2021-06-19 07:36:05 发布

Tylor_Durden

最新推荐文章于 2021-06-19 07:36:05 发布

阅读量341

点赞数

文章标签： tensorflow

本文链接：https://blog.csdn.net/Tylor_Durden/article/details/107215313

版权

最近在用Tensoflow的MirroredStrategy进行深度网络的单机多GPU训练，本文分享一下使用过程中遇到的坑和解决办法。在Tensorflow中，分布式策略MirroredStrategy的与滑动指数平均ExponentialMovingAverage在设计上存在一定的矛盾。这个矛盾具体来说就是：MirroredStrategy会在多个scope中创建可reuse的variable，但当我们同时使用EMA时，它会生成一些错误的变量名，由此导致以下报错：

RuntimeError: Tried to create variable dense/kernel/replica_1/ExponentialMovingAverage/ with mismatching name on device 1

出错代码示例如下（没有使用我自己的代码，简易代码来自[1]）：

import tensorflow as tf


def input_fn():
    features = tf.data.Dataset.from_tensors([1., 2., 3.])
    labels = tf.data.Dataset.from_tensors(1.)
    dataset = tf.data.Dataset.zip((features, labels)).repeat(10).batch(1)
    return dataset


def model_fn(features, labels, mode, params):
    logits = tf.layers.dense(features, 1, activation=tf.nn.relu)
    logits = tf.reshape(logits, (-1,))
    loss = tf.losses.mean_squared_error(logits, labels)

    ema = tf.train.ExponentialMovingAverage(0.999)

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0001)
        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

        # apply moving averages
        with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
            ema_update_op = ema.apply(tf.trainable_variables())

        with tf.control_dependencies([train_op]):
            train_op = tf.group(ema_update_op)

        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    raise NotImplementedError


def train(model_dir):
    distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        config=tf.estimator.RunConfig(
            train_distribute=distribution,
            model_dir=model_dir,
            log_step_count_steps=1,
        ),
    )

    estimator.train(input_fn=input_fn)


if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.INFO)
    train('training/test1')

简言之，出错来源于tf.estimator + tf.contrib.distribute.MirroredStrategy + tf.train.ExponentialMovingAverage同时使用

解决方法之一（在我的代码里work）：
将ema定义放在model_fn之外，如下：

def input_fn():
  features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
  labels = tf.data.Dataset.from_tensors(1.).repeat(100)
  return tf.data.Dataset.zip((features, labels))

ema = tf.train.ExponentialMovingAverage(decay=0.9999)

def model_fn(features, labels, mode):
  regularizer = tf.contrib.layers.l2_regularizer(0.001)
  layer = tf.compat.v1.layers.Dense(1, kernel_regularizer=regularizer)
  logits = layer(features)

  loss = tf.compat.v1.losses.mean_squared_error(labels, tf.reshape(logits, []))

  global_step = tf.compat.v1.train.get_or_create_global_step()

  # NOTE Here we set the global step for the ema
  ema._num_updates = global_step

  train_op = tf.compat.v1.train.GradientDescentOptimizer(0.2).minimize(loss, global_step=global_step)

  ema_vars = tf.compat.v1.trainable_variables()

  with tf.control_dependencies([train_op]):
    train_op = ema.apply(ema_vars)

  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)
config = tf.estimator.RunConfig(log_step_count_steps=10, train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn)

其他解决方法见参考资料[1]，最后想多说一句，遇到代码上的问题要多学会访问国外的论坛，如stakoverflow和github，而不是只会用百度和CSDN

参考资料
[1] https://github.com/tensorflow/tensorflow/issues/27392