Tensoflow MirroredStrategy 策略下 ExponentialMovingAverage出错解决方法

最近在用Tensoflow的MirroredStrategy进行深度网络的单机多GPU训练,本文分享一下使用过程中遇到的坑和解决办法。在Tensorflow中,分布式策略MirroredStrategy的与滑动指数平均ExponentialMovingAverage在设计上存在一定的矛盾。这个矛盾具体来说就是:MirroredStrategy会在多个scope中创建可reuse的variable,但当我们同时使用EMA时,它会生成一些错误的变量名,由此导致以下报错:

RuntimeError: Tried to create variable dense/kernel/replica_1/ExponentialMovingAverage/ with mismatching name on device 1

出错代码示例如下(没有使用我自己的代码,简易代码来自[1]):

import tensorflow as tf


def input_fn():
    features = tf.data.Dataset.from_tensors([1., 2., 3.])
    labels = tf.data.Dataset.from_tensors(1.)
    dataset = tf.data.Dataset.zip((features, labels)).repeat(10).batch(1)
    return dataset


def model_fn(features, labels, mode, params):
    logits = tf.layers.dense(features, 1, activation=tf.nn.relu)
    logits = tf.reshape(logits, (-1,))
    loss = tf.losses.mean_squared_error(logits, labels)

    ema = tf.train.ExponentialMovingAverage(0.999)

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0001)
        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

        # apply moving averages
        with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
            ema_update_op = ema.apply(tf.trainable_variables())

        with tf.control_dependencies([train_op]):
            train_op = tf.group(ema_update_op)

        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    raise NotImplementedError


def train(model_dir):
    distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        config=tf.estimator.RunConfig(
            train_distribute=distribution,
            model_dir=model_dir,
            log_step_count_steps=1,
        ),
    )

    estimator.train(input_fn=input_fn)


if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.INFO)
    train('training/test1')

简言之,出错来源于tf.estimator + tf.contrib.distribute.MirroredStrategy + tf.train.ExponentialMovingAverage同时使用

解决方法之一(在我的代码里work):
将ema定义放在model_fn之外,如下:

def input_fn():
  features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
  labels = tf.data.Dataset.from_tensors(1.).repeat(100)
  return tf.data.Dataset.zip((features, labels))

ema = tf.train.ExponentialMovingAverage(decay=0.9999)

def model_fn(features, labels, mode):
  regularizer = tf.contrib.layers.l2_regularizer(0.001)
  layer = tf.compat.v1.layers.Dense(1, kernel_regularizer=regularizer)
  logits = layer(features)

  loss = tf.compat.v1.losses.mean_squared_error(labels, tf.reshape(logits, []))

  global_step = tf.compat.v1.train.get_or_create_global_step()

  # NOTE Here we set the global step for the ema
  ema._num_updates = global_step

  train_op = tf.compat.v1.train.GradientDescentOptimizer(0.2).minimize(loss, global_step=global_step)

  ema_vars = tf.compat.v1.trainable_variables()

  with tf.control_dependencies([train_op]):
    train_op = ema.apply(ema_vars)

  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=2)
config = tf.estimator.RunConfig(log_step_count_steps=10, train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn)

其他解决方法见参考资料[1],最后想多说一句,遇到代码上的问题要多学会访问国外的论坛,如stakoverflow和github,而不是只会用百度和CSDN

参考资料
[1] https://github.com/tensorflow/tensorflow/issues/27392

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值