使用 TensorFlow 2.0 进行分布式训练

最新推荐文章于 2023-07-04 14:17:29 发布

两个幽灵

最新推荐文章于 2023-07-04 14:17:29 发布

阅读量7.6k

点赞数 7

分类专栏： TensorFlow学习

原文链接：http://www.tensorflow.com

版权

TensorFlow学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

TensorFlow官方最新翻译：https://www.tensorflow.org/guide/distributed_training?hl=zh-cn，本文的翻译过时了

概览

tf.distribute.Strategy 是一个用于分布式训练的TensorFlow API ，横跨多GPU、多机器或TPU. 通过这个API，经过少量改造，就可以让现存的模型和训练代码支持分布式训练。

tf.distribute.Strategy 的设计目标是:

容易使用，支持多种用户，包括研究员、机器学习工程师等等
提供良好的性能、开箱即用
更容易更换策略

tf.distribute.Strategy 能被用于高级API比如 Keras, 也能用于自定义训练循环 (基于TensorFlow的各种计算).

在TensorFlow 2.0中，可以急切执行程序, 或者使用 tf.function 的图. tf.distribute.Strategy 打算支持这两种执行方式. 尽管在本文中只讨论了训练, 但是该API也可以用于在不同平台上分发评估和预测。

可以通过对代码进行很少的修改来使用 tf.distribute.Strategy , 因为TensorFlow的基础组件可以感知分布式策略. 基础组件包括变量、层、优化器、指标、摘要和检查点.

本文中，我们将解释不同类型的策略，以及在不同情况下怎么使用它们。

# Import TensorFlow
!pip install -q tf-nightly
import tensorflow as tf

ERROR: tensorflow 2.1.0 has requirement gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.

分布式策略的类型

tf.distribute.Strategy 打算涵盖不同方面的许多用例。目前支持其中一些组合，将来还会添加其他组合。那么要在哪些维度上进行组合呢：

同步训练与异步训练：这是通过数据并行性分配训练的两种常用方法。在同步训练中，所有工作节点都同步地对输入数据的不同片段进行训练，并在每个步骤中汇总梯度。在异步训练中，所有工作节点都在独立训练输入数据并异步更新变量。通常情况下，同步训练通过全归约(all-reduce)实现，异步训练通过参数服务实现
硬件平台：您可能希望将训练扩展到一台计算机上的多个GPU或网络中的多台计算机（每个具有0个或多个GPU）或Cloud TPU上。

为了支持这些用例，我们提供了六种策略。下一节将说明当前在TF 2.0中的哪些场景中支持哪些策略。

训练API	Mirrored Strategy	TPU Strategy	MultiWorker Mirrored Strategy	Central Storage Strategy	Parameter Server Strategy	One Device Strategy
	镜像策略	TPU策略	多节点镜像策略	中央存储策略	参数服务器策略	单设备策略
Keras API	支持	实验性支持	实验性支持	实验性支持	计划在2.0后支持	支持
自定义训练循环	实验性支持	实验性支持	计划在2.0后支持	计划在2.0后支持	还不支持	支持
Estimator API	有限支持	不支持	有限支持	有限支持	有限支持	有限支持

注意: 对Estimator只提供了有限支持。基本的训练和验证都是实验性的, 并且高级特征—比如scaffold—没有实现. 如果没有覆盖你想要的场景，我们推荐使用Keras或自定义训练循环.

镜像策略 MirroredStrategy

tf.distribute.MirroredStrategy 支持在单机多GPU上的同步分布式训练. 它在每个GPU设备上创建一个副本. 模型中的每个变量都将在所有副本之间进行镜像。这些变量一起形成一个称为MirroredVariable的概念上的变量。通过应用相同的更新，这些变量彼此保持同步。

高效的归约算法用于在设备之间传递变量更新。全归约通过对不同设备上的张量相加进行聚合, 并使他们在所有设备上可用。这是一种融合算法，非常有效，可以大大减少同步的开销。根据设备之间可用的通信类型，有许多归约算法和实现可用，默认使用NVIDIA NCCL。您可以从我们提供的其他选项中进行选择，也可以自己编写。

这是创建 MirroredStrategy 最简单的方法：

mirrored_strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)

这会创建一个 MirroredStrategy 实例，将会使用TensorFlow所有可见的GPU, 使用NCCL进行跨设备通信。

如果您只想使用计算机上的某些GPU，可以这样做：

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:1,/job:localhost/replica:0/task:0/device:GPU:0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

如果希望重写跨设备通信，可以为cross_device_ops参数提供一个 tf.distribute.CrossDeviceOps实例。目前提供的Reduce算法有三个：tf.distribute.HierarchicalCopyAllReduce 、tf.distribute.ReductionToOneDevice 和默认选项 tf.distribute.NcclAllReduce 。

mirrored_strategy = tf.distribute.MirroredStrategy(
    cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)

中央存储策略 CentralStorageStrategy

tf.distribute.experimental.CentralStorageStrategy也执行同步训练，但是变量不会被镜像，而是放在CPU上。各操作(operation)在本地GPU之间复制进行。如果只有一个GPU，变量和操作都会放在GPU上。

创建一个 CentralStorageStrategy 实例：

central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()

INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ['/job:localhost/replica:0/task:0/device:GPU:0'], variable_device = '/job:localhost/replica:0/task:0/device:GPU:0'

这会创建一个 CentralStorageStrategy 实例使用所有可见的CPU和GPU。在更新应用到变量之前，不同副本上变量的更新将会汇总。

注意: 该策略是 实验性的 ，因为我们正在对它进行改进，使他能在更多场景下工作. 敬请期待此API的变化。

多节点镜像策略 MultiWorkerMirroredStrategy

tf.distribute.experimental.MultiWorkerMirroredStrategy 实现了跨节点(worker)的同步分布式训练，每个节点可能有多个GPU。类似于 MirroredStrategy, 它会在所有节点的每个设备(CPU/GPU)上的模型中创建所有变量的副本。

它使用集合运算(CollectiveOps)作为多节点的全归约通信方法，用于使变量保持同步。集合运算是TensorFlow图中的单个运算，它可以根据硬件、网络拓扑和张量大小在TensorFlow运行时中自动选择归约算法。

它还实现了其他性能优化。例如静态优化：该将小张量的多个全归约，转换为大张量的较少的全归约。另外，我们正在为它设计插件架构——以便将来您将能够为您的硬件调整更好的插件算法。集合运算还实现了其它并行算法需要的操作，比如广播和all-gather。

这是创建 MultiWorkerMirroredStrategy 最简单的办法：

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0',)
INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.AUTO

MultiWorkerMirroredStrategy 现在允许选择两种集合运算. CollectiveCommunication.RING 实现了基于环的集合运算，使用gRPC作为通信层. CollectiveCommunication.NCCL通过通过Nvidia’s NCCL实现。 CollectiveCommunication.AUTO 尊重运行时runtime的选择。选择哪一种集合运算更好取决于GPU的数量和种类以及集群的网络拓扑。可以通过以下方式指定集合运算：

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    tf.distribute.experimental.CollectiveCommunication.NCCL)

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0',)
INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.NCCL

与多GPU训练相比，多节点训练的主要区别是节点的设置。 TF_CONFIG 环境变量是TensorFlow中为群集的每个节点指定群集配置的标准方法。详细资料参见配置TF_CONFIG.

注意: 该策略是 实验性的 ，因为我们正在对它进行改进，使他能在更多场景下工作. 敬请期待此API的变化。

参数服务器策略 ParameterServerStrategy

tf.distribute.experimental.ParameterServerStrategy 支持在多台计算机上进行参数服务器训练。在这种方式下, 某些机器被指定为工作节点，而另一些被指定为参数服务器。模型的每个变量都放在参数服务器上。计算结果在所有节点的所有GPU之间复制。

在代码方面，它看起来与其他策略类似：

ps_strategy = tf.distribute.experimental.ParameterServerStrategy()

对于多节点训练, 需要通过TF_CONFIG 指定参数服务器和工作节点的配置，下面有一些TF_CONFIG的具体介绍。

单设备策略 OneDeviceStrategy

tf.distribute.OneDeviceStrategy 在单个设备上运行。此策略会将在其作用域中创建的所有变量放在指定设备上。通过此策略分配的输入将被预取到指定的设备。此外，通过 strategy.run 调用的函数也将放置在指定的设备上。

在切换到其它设备之前，可以使用此策略来测试代码。

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

到目前为止，我们已经讨论了可用的不同策略以及如何实例化它们。在接下来的几节中，我们将讨论使用使用这些分布式策略进行训练的不同方法。

在Keras上使用

我们已经将 tf.distribute.Strategy 集成到 tf.keras 中。tf.keras 是一个构建和训练模型的高级API。通过集成到 tf.keras 后端, 用Keras训练框架写的程序可以无缝进行分布式训练。

您需要对代码中进行以下更改：

创建一个 tf.distribute.Strategy 实例
将Keras模型的创建和编译过程挪到strategy.scope中

支持各种类型的Keras模型：顺序模型、函数式模型和子类模型：

下面是一个非常简单的Keras模型示例：

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)

本例中我们使用 MirroredStrategy ，单机多GPU的情况. strategy.scope() 指出哪一部分代码要分布式运行。在此范围内创建模型会创建镜像变量而不是常规变量。在范围内进行编译可以使TF知道用户打算使用这种策略来训练该模型。设置完成后，您就可以像平常一样拟合模型。MirroredStrategy 负责在可用GPU上复制模型的训练，聚合梯度等。

dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)

Epoch 1/2
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
10/10 [==============================] - 0s 2ms/step - loss: 1.4031
Epoch 2/2
10/10 [==============================] - 0s 1ms/step - loss: 0.6202
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
10/10 [==============================] - 0s 1ms/step - loss: 0.3851

0.3851303160190582

在这两种情况下（数据集或numpy），输入的每个batch都被均分到每个副本上. 例如, 如果在2个GPU上使用MirroredStrategy , 大小为10的一个batch会在2个GPU上平分, 每一步每个设备收到5个样本。如果有更多GPU训练会更快. 在添加更多加速器时通常会增加批处理大小，以便有效利用额外的计算能力，还需要根据模型重新调整学习率。可以通过 strategy.num_replicas_in_sync 变量得到副本数。

# Compute global batch size using number of replicas.
BATCH_SIZE_PER_REPLICA = 5
global_batch_size = (BATCH_SIZE_PER_REPLICA *
                     mirrored_strategy.num_replicas_in_sync)
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100)
dataset = dataset.batch(global_batch_size)

LEARNING_RATES_BY_BATCH_SIZE = {5: 0.1, 10: 0.15}
learning_rate = LEARNING_RATES_BY_BATCH_SIZE[global_batch_size]

现在支持什么

在 TF 2.0 中, MirroredStrategy, TPUStrategy, CentralStorageStrategy 和MultiWorkerMirroredStrategy 在Keras中支持. 除了 MirroredStrategy, 其它都是在实验阶段可能会发生变化，对其他策略的支持即将推出。该API及其使用方法与上述完全相同。

在自定义训练循环上使用

在高级 API上使用tf.distribute.Strategy 只需要修改几行代码. 再多花点功夫，你也可以在自定义训练循环上应用 tf.distribute.Strategy.

如果您需要比Estimator或Keras更大的灵活性和对训练循环更强的控制，则可以编写自定义训练循环。例如，使用GAN时，您可能希望每轮生成器和鉴别器训练不同的steps。同样，高级框架也不太适合强化学习。

为了支持自定义训练循环，我们通过 tf.distribute.Strategy 类提供了一组核心方法。刚开始使用这些代码需要对代码进行较小的重组，但是一旦完成，只需更改Strategy实例就应该能够在GPU、TPU和多台计算机之间切换。

在这里，我们将展示一个简短的代码片段说明这个用例：使用与以前一样的Keras模型。

首先，需要在策略范围内创建模型和Optimizer。这样可以确保使用模型和Optimizer创建的任何变量都是镜像变量。

with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  optimizer = tf.keras.optimizers.SGD()

接下来, 我们要创建输入数据集并调用 tf.distribute.Strategy.experimental_distribute_dataset 将数据集按此策略分布：

dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(1000).batch(
    global_batch_size)
dist_dataset = mirrored_strategy.experimental_distribute_dataset(dataset)

然后定义训练的一个step，使用 tf.GradientTape 计算梯度，使用optimizer将梯度用于更新model变量。要分发这个训练步骤, 我们把它放到 step_fn 函数中，将次函数和从 dist_dataset 创建的数据集一起传递给 tf.distrbute.Strategy.run ：

@tf.function
def train_step(dist_inputs):
  def step_fn(inputs):
    features, labels = inputs

    with tf.GradientTape() as tape:
      # training=True is only needed if there are layers with different
      # behavior during training versus inference (e.g. Dropout).
      logits = model(features, training=True)
      cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
          logits=logits, labels=labels)
      loss = tf.reduce_sum(cross_entropy) * (1.0 / global_batch_size)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    return cross_entropy

  per_example_losses = mirrored_strategy.run(step_fn, args=(dist_inputs,))
  mean_loss = mirrored_strategy.reduce(
      tf.distribute.ReduceOp.MEAN, per_example_losses, axis=0)
  return mean_loss

上面的代码中还有一些其他注意事项：

我们用 tf.nn.softmax_cross_entropy_with_logits 计算损失函数，然后将总损失按全局batch_size进行缩放。这很重要，因为所有副本都正在同步训练，并且训练的每个步骤中的样本数都是全局batch, 并且每一步的样本数是全局batch，所以总损失需要除以全局batch size而不是一个副本的局部batch size.
使用 tf.distribute.Strategy.reduce API 聚集 tf.distribute.Strategy.run 返回的结果. tf.distribute.Strategy.run 返回每个局部副本的结果, 并且有多种方法可以使用此结果，可以 reduce 它们获得聚合结果，也可以通过 tf.distribute.Strategy.experimental_local_results 获得每个副本的结果(放在list中).
当在一个分布式策略内调用apply_gradients的时候，它的行为被修改了. 具体来说，在同步训练期间将梯度应用于每个并行实例之前，它会将所有副本的梯度求和。

定义好training_step之后, 我们就可以迭代 dist_dataset 循环进行训练:

with mirrored_strategy.scope():
  for inputs in dist_dataset:
    print(train_step(inputs))

tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
....

在上面的示例中，我们遍历了 dist_dataset 为训练提供输入. 我们也提供了 tf.distribute.Strategy.make_experimental_numpy_dataset 支持numpy输入. 你可以在调用 tf.distribute.Strategy.experimental_distribute_dataset 之前使用此API创建数据集。

迭代数据的另一种方法是显式使用迭代器。当您要运行给定数量的步骤而不是遍历整个数据集时，可能需要执行此操作。现在修改上面的迭代，首先创建一个迭代器，然后显式调用 next 获得输入数据。

with mirrored_strategy.scope():
  iterator = iter(dist_dataset)
  for _ in range(10):
    print(train_step(next(iterator)))

tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor(0.0, shape=(), dtype=float32)
....

上面是使用 tf.distribute.Strategy API 进行分布式训练最简单的情况. 我们正在改进这些API。由于该用例需要更多工作来适应您的代码，因此我们将来会发布单独的详细指南。

现在支持什么

在TF2.0中, 自定义训练循环只支持 MirroredStrategy 和TPUStrategy. MultiWorkerMirorredStrategy 未来将会支持。

Training API	MirroredStrategy	TPUStrategy	MultiWorkerMirroredStrategy	CentralStorageStrategy	ParameterServerStrategy	OneDeviceStrategy
Custom Training Loop	Experimental support	Experimental support	Support planned post 2.0	Support planned post 2.0	No support yet	Supported

在Estimator上使用 (有限支持)

tf.estimator 是TensorFlow分布式训练API，原来就支持异步参数服务器方法。类似Keras，我们已经把 tf.distribute.Strategy 集成到tf.Estimator中了. 如果你在使用Estimator，只需要修改少量代码就能进行分布式训练。Estimator现在支持多GPU或多节点或多TPU的同步训练，但是只提供有限支持。

Estimator 中 tf.distribute.Strategy 的用法与Keras略有不同. 不是使用 strategy.scope, 而是把strategy 对象传递给Estimator的RunConfig 中。

这是一个用了预制Estimator LinearRegressor 和 MirroredStrategy的代码片段：

mirrored_strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(
    train_distribute=mirrored_strategy, eval_distribute=mirrored_strategy)
regressor = tf.estimator.LinearRegressor(
    feature_columns=[tf.feature_column.numeric_column('feats')],
    optimizer='SGD',
    config=config)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpcz2vlwl_
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpcz2vlwl_', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f13a0149e48>, '_device_fn': None, '_protocol': None, '_eval_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f13a0149e48>, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': None}

虽然上面我们用了预制的Estimator, 但是相同的代码在自定义Estimator上也能正常工作. train_distribute 决定了训练应该怎么分布, eval_distribute 决定了评估应该怎么分布。这就是与Keras的区别：Keras在训练和评估时都使用相同的策略。

现在，我们可以使用输入函数来训练和评估此估算器：

def input_fn():
  dataset = tf.data.Dataset.from_tensors(({"feats":[1.]}, [1.]))
  return dataset.repeat(1000).batch(10)
regressor.train(input_fn=input_fn, steps=10)
regressor.evaluate(input_fn=input_fn, steps=10)

与Keras相比，另一个要强调的的区别是输入的处理. 在Keras中，我们提到每一批数据集在不同副本之间自动分割. 在Estimator中不会自动分割，也不会在节点之间共享，需要自己控制怎么分布数据集。必须提供 input_fn 指定怎么分布数据。

input_fn 在每个节点上都要调用一次, 所以要给每个节点一个数据集. 因此一个节点上的N个副本要消耗N个batch。换句话说, input_fn 返回的数据集要提供的batch size是PER_REPLICA_BATCH_SIZE，全局batch size要通过PER_REPLICA_BATCH_SIZE * strategy.num_replicas_in_sync来获取。

当要做多节点训练时，需要把数据在节点之间分割，或者用随机打乱。可以参考这个例子 Multi-worker Training with Estimator。

上面展示的是一个使用 MirroredStrategy 的例子，也可以用 TPUStrategy：

config = tf.estimator.RunConfig(
    train_distribute=tpu_strategy, eval_distribute=tpu_strategy)

类似的, 也可以用多节点和参数服务器策略。代码都一样，只是需要用 tf.estimator.train_and_evaluate, 并且对每个节点都要设置 TF_CONFIG 环境变量。

现在支持什么

在TF 2.0中, 除了 TPUStrategy，对于Estimator的训练都只提供了有限支持。基本的训练和验证可以正常工作，但是一部分高级特性比如scaffold 不能正常工作，在集成过程中还可能有一些BUG。同时我们也不打算再提供支持了，而是专注于 Keras 和自定义循环. 如果可能的话，不要再用Estimator了。

Training API	MirroredStrategy	TPUStrategy	MultiWorkerMirroredStrategy	CentralStorageStrategy	ParameterServerStrategy	OneDeviceStrategy
Estimator API	Limited Support	Not supported	Limited Support	Limited Support	Limited Support	Limited Support

设置`TF_CONFIG`环境变量

对于多节点训练，需要给每个单元设置 TF_CONFIG 环境变量。TF_CONFIG环境变量是一个JSON字符串，指定了哪些节点构成集群, 它们的地址和每个节点的角色。我们在tensorflow/ecosystem仓库的Kubernetes模板中提供了一个设置 TF_CONFIG 的例子。

一个例子是：

os.environ["TF_CONFIG"] = json.dumps({
    "cluster": {
        "worker": ["host1:port", "host2:port", "host3:port"],
        "ps": ["host4:port", "host5:port"]
    },
   "task": {"type": "worker", "index": 1}
})

上述TF_CONFIG 指出集群中一共有三个节点和两个ps任务。“task” 部分指出集群中当前节点的角色是worker[1] (第二个worker)。有效的角色是"chief"、“worker”、“ps” 和"evaluator". ps是参数服务器，因此只有使用 tf.distribute.experimental.ParameterServerStrategy时才有ps角色。

What’s next?

tf.distribute.Strategy 正在开发中. 欢迎试用并提供反馈：GitHub issues

原文：https://tensorflow.google.cn/guide/distributed_training

两个幽灵

关注

7
点赞
踩
32

收藏

觉得还不错? 一键收藏
3
评论
使用 TensorFlow 2.0 进行分布式训练

概览tf.distribute.Strategy 是一个用于分布式训练的TensorFlow API ，横跨多GPU、多机器或TPU. 通过这个API，经过少量改造，就可以让现存的模型和训练代码支持分布式训练。tf.distribute.Strategy 的设计目标是:容易使用，支持多种用户，包括研究员、机器学习工程师等等提供良好的性能、开箱即用更容易更换策略tf.distribute.Strategy 能被用于高级API比如 Keras, 也能用于自定义训练循环 (基于TensorFlow
复制链接

扫一扫