[深度学习] 分布式Tensorflow 2.0 介绍（二）

最新推荐文章于 2024-08-20 19:08:50 发布

舒克与贝克

最新推荐文章于 2024-08-20 19:08:50 发布

阅读量1.3w

点赞数 11

分类专栏：深度学习文章标签： tensorflow 分布式

本文为摩登都市天空博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/zwqjoy/article/details/89552866

版权

深度学习专栏收录该内容

75 篇文章 50 订阅

订阅专栏

[深度学习] 分布式模式介绍（一）

[深度学习] 分布式Tensorflow 2.0介绍（二）

[深度学习] 分布式Pytorch 1.0介绍（三）

[深度学习] 分布式Horovod介绍（四）

一单GPU训练 vs 多GPU训练

单GPU训练 一般代码比较简单，并且能满足我们的基本需求，通常做法是设定变量CUDA_VISIBLE_DEVICES的值为某一块GPU来Mask我们机器上的GPU设备，虽然有时当我们忘了设定该变量时程序会自动占用所有的GPU资源，但如果没有相应的代码去分配掌控GPU资源的使用的话，程序还是只会利用到第一张卡的计算资源，其他的资源则仅是占用浪费状态。

多GPU训练 则可以从两个方面提升我们模型训练的上限：1. 超过单卡显存上限的模型大小， 2. 更大的Batch Size和更快训练速度。

单机的多GPU训练， tensorflow的官方已经给了一个cifar的例子，已经有比较详细的代码和文档介绍，这里大致说下多GPU的过程，以便方便引入到多机多GPU的介绍。
单机多GPU的训练过程：

假设你的机器上有3个GPU;
在单机单GPU的训练中，数据是一个batch一个batch的训练。在单机多GPU中，数据一次处理3个batch(假设是3个GPU训练），每个GPU处理一个batch的数据计算。
变量，或者说参数，保存在CPU上
刚开始的时候数据由CPU分发给3个GPU，在GPU上完成了计算，得到每个batch要更新的梯度。
然后在CPU上收集完了3个GPU上的要更新的梯度，计算一下平均梯度，然后更新参数。
然后继续循环这个过程。

通过这个过程，处理的速度取决于最慢的那个GPU的速度。如果3个GPU的处理速度差不多的话，处理速度就相当于单机单GPU的速度的3倍减去数据在CPU和GPU之间传输的开销，实际的效率提升看CPU和GPU之间数据的速度和处理数据的大小。

二分布式TensorFlow

Tensorflow分布式训练的支持主要是通过tf.distribute.Strategy来实现

1 MirroredStrategy 单机多卡训练

in-graph replication with synchronous

MirroredStrategy是一种支持多张GPU在同一个机器上的同步训练方法。在训练开始时，Mirrored会在每张卡上复制一份模型，

每个显卡会收到tf.data.Dataset传来的数据，独立计算梯度，然后采用all-reduce的方法进行同步更新。多个显卡在通信时默认使用Nvidia NCCL进行。

我们可以深入MirroredStrategy的实现了解一下。基本上所有的distributed strategy都是通过某些collective ops和cross device ops进行数据通讯。MirroredStrategy也是如此，它是这样选择cross device ops的：

if len(workers) > 1:
  if not isinstance(self._cross_device_ops, cross_device_ops_lib.MultiWorkerAllReduce):
    raise ValueError(
      "In-graph multi-worker training with `MirroredStrategy` is not "
      "supported.")
    self._inferred_cross_device_ops = self._cross_device_ops
else:
  # TODO(yuefengz): make `choose_the_best` work with device strings
  # containing job names.
  self._inferred_cross_device_ops = cross_device_ops_lib.NcclAllReduce()

这也就印证了MirroredStrategy在单机多卡的情况下默认使用NCCL来进行通信的说明。具体的实现大家可以去查看AllReduceCrossDeviceOps的实现。

同时，上面的程序也说明MirroredStrategy可以运用到多机多卡的情况中去，然而多机多卡的情况下用户需要自己传入cross_device_ops_lib.MultiWorkerAllReduce进行通讯，这里MultiWorkerAllReduce支持若干种通讯方式，比如nccl, nccl/xring, nccl/rechd, nccl/pscpu, xring, pscpu, pscpu/pscpu等等。由于目前最佳的通讯方式需要NCCL2.0加上xring，然而Tensorflow目前使用NCCL 1.1，并且nccl/xring在现有的代码中有bug无法工作，所以这一模式常常被大家诟病。

MirroredStrategy instance which will use all the GPUs that are visible to TensorFlow, and use NCCL as the cross device communication.

训练脚本就会自动进行分布式训练。如果你只想用主机上的部分GPU训练

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

用户通过该API控制使用何种分布式架构，例如如果用户需要在单机多卡环境中使用All-Reduce架构，只需定义对应架构下的Strategy，指定Estimator的config参数即可：

mirrored_strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(
    train_distribute=mirrored_strategy, eval_distribute=mirrored_strategy)
regressor = tf.estimator.LinearRegressor(
    feature_columns=[tf.feature_column.numeric_column('feats')],
    optimizer='SGD',
    config=config)

tf.keras 例子

import tensorflow as tf
import tensorflow_datasets as tfds

num_epochs = 5
batch_size_per_replica = 64
learning_rate = 0.001

strategy = tf.distribute.MirroredStrategy()
print('Number of devices: %d' % strategy.num_replicas_in_sync)  # 输出设备数量
batch_size = batch_size_per_replica * strategy.num_replicas_in_sync

# 载入数据集并预处理
def resize(image, label):
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, label

# 当as_supervised为True时，返回image和label两个键值
dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN, as_supervised=True)
dataset = dataset.map(resize).shuffle(1024).batch(batch_size)

with strategy.scope():
    model = tf.keras.applications.MobileNetV2()
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=[tf.keras.metrics.sparse_categorical_accuracy]
    )

model.fit(dataset, epochs=num_epochs)

MirroredStrategy 的步骤如下：

训练开始前，该策略在所有 N 个计算设备上均各复制一份完整的模型；
每次训练传入一个批次的数据时，将数据分成 N 份，分别传入 N 个计算设备（即数据并行）；
N 个计算设备使用本地变量（镜像变量）分别计算自己所获得的部分数据的梯度；
使用分布式计算的 All-reduce 操作，在计算设备间高效交换梯度数据并进行求和，使得最终每个设备都有了所有设备的梯度之和；
使用梯度求和的结果更新本地变量（镜像变量）；
当所有设备均更新本地变量后，进行下一轮训练（即该并行策略是同步的）。

默认情况下，TensorFlow 中的 MirroredStrategy 策略使用 NVIDIA NCCL 进行 All-reduce 操作。

2 MultiWorkerMirroredStrategy 多机训练

对于分布式多机环境，最早是Uber专门提出了一种基于Ring-Allreduce的分布式TensorFlow架构Horovod，并已开源。

tf.distribute.experimental.MultiWorkerMirroredStrategy与MirroredStrategy非常类似，都在每一个device上存储一份模型的备份，进行同步的分布式训练。

该策略采用CollectiveOps作为多个worker之间通讯的操作。所谓的collective op是Tensorflow自己实现的根据当前硬件环境，网络结构，和Tensor大小自动采用最佳算法进行all-reduce的计算操作。一个collective op的实现逻辑十分简单

if (CanProceedWithCompute(c, col_exec, done)) {
  col_exec->ExecuteAsync(
    c, col_params_, GetCollectiveKey(c), actual_done);
}

c是当前op的计算状态，col_exec是Tensorflow根据系统情况选择的collective executor，所有的all reduce，boardcast和receive操作都有collective executor去执行。

该策略目前也实现了很多优化，比如将很多个小tensor的all reduce操作变成几个大tensor的all reduce操作，以及在开发当中的采用最新NCCL 2.0进行通讯的操作，具体可以参见Issue 24505。可以看出Tensorflow分布式训练在被吐槽很多次后，感受到了来自Pytorch，Horovod的压力，在努力的提升自己。

最后，关于MultiWorkerMirroredStrategy的配置，有两点需要注意。

一点是collective ops的策略选择，目前支持CollectiveCommunication.RING，采用与Horovod类似的ring-based通讯策略。另一个是CollectiveCommunication.NCCL，采用Nvidia NCCL进行通讯，在启动策略时可以传入参数指定：

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
  tf.distribute.experimental.CollectiveCommunication.NCCL)

CollectiveCommunication.AUTO defers the choice to the runtime.

另一个需要注意的是关于TF_CONFIG的设置，该策略并不需要指定Parameter server，只需要一系列worker即可，其配置如下：

TF_CONFIG = {
  'cluster': {
    'worker': ['worker1:port1', 'worker2:port2', 'worker3:port3', ...]
  },
  'task': {'type': 'worker', 'index': 0}
})

目前该API尚处于实验阶段。如果在代码中通过MultiWorkerMirroredStrategy指定使用All-Reduce架构，则分布式提交时，TF_CONFIG环境变量中的cluster就不需要ps类型的节点了，例如：

TF_CONFIG='{
    "cluster": {
        "worker": ["host1:2222", "host2:2222", "host3:2222"]
    },
    "task": {"type": "work", "index": 0}
}'

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
config = tf.estimator.RunConfig(
    train_distribute=strategy, eval_distribute=strategy)
regressor = tf.estimator.LinearRegressor(
    feature_columns=[tf.feature_column.numeric_column('feats')],
    optimizer='SGD',
    config=config)

多机训练的方法和单机多卡类似，将 MirroredStrategy 更换为适合多机训练的 MultiWorkerMirroredStrategy 即可。不过，由于涉及到多台计算机之间的通讯，还需要进行一些额外的设置。具体而言，需要设置环境变量 TF_CONFIG ，示例如下:

os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': ["localhost:20000", "localhost:20001"]
    },
    'task': {'type': 'worker', 'index': 0}
})

TF_CONFIG 由 cluster 和 task 两部分组成：

cluster 说明了整个多机集群的结构和每台机器的网络地址（IP + 端口号）。对于每一台机器，cluster 的值都是相同的；
task 说明了当前机器的角色。例如， {'type': 'worker', 'index': 0} 说明当前机器是 cluster 中的第 0 个 worker（即 localhost:20000 ）。每一台机器的 task 值都需要针对当前主机进行分别的设置。

以上内容设置完成后，在所有的机器上逐个运行训练代码即可。先运行的代码在尚未与其他主机连接时会进入监听状态，待整个集群的连接建立完毕后，所有的机器即会同时开始训练。

请在各台机器上均注意防火墙的设置，尤其是需要开放与其他主机通信的端口。如上例的 0 号 worker 需要开放 20000 端口，1 号 worker 需要开放 20001 端口。

以下示例的训练任务与前节相同，只不过迁移到了多机训练环境。假设我们有两台机器，即首先在两台机器上均部署下面的程序，唯一的区别是 task 部分，第一台机器设置为 {'type': 'worker', 'index': 0} ，第二台机器设置为 {'type': 'worker', 'index': 1} 。接下来，在两台机器上依次运行程序，待通讯成功后，即会自动开始训练流程。

tf.keras例子

import tensorflow as tf
import tensorflow_datasets as tfds
import os
import json

num_epochs = 5
batch_size_per_replica = 64
learning_rate = 0.001

num_workers = 2
os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': ["localhost:20000", "localhost:20001"]
    },
    'task': {'type': 'worker', 'index': 0}
})
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
batch_size = batch_size_per_replica * num_workers

def resize(image, label):
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, label

dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN, as_supervised=True)
dataset = dataset.map(resize).shuffle(1024).batch(batch_size)

with strategy.scope():
    model = tf.keras.applications.MobileNetV2()
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=[tf.keras.metrics.sparse_categorical_accuracy]
    )

model.fit(dataset, epochs=num_epochs)

3. CentralStorageStrategy

tf.distribute.experimental.CentralStorageStrategy也执行同步训练，但是变量不会被镜像，而是放在CPU上。各操作(operation)在本地GPU之间复制进行。如果只有一个GPU，变量和操作都会放在GPU上。

创建一个 CentralStorageStrategy 实例：

central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()

INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ['/job:localhost/replica:0/task:0/device:GPU:0'], variable_device = '/job:localhost/replica:0/task:0/device:GPU:0'

这会创建一个 CentralStorageStrategy 实例使用所有可见的CPU和GPU。在更新应用到变量之前，不同副本上变量的更新将会汇总。

注意: 该策略是 实验性的 ，因为我们正在对它进行改进，使他能在更多场景下工作. 敬请期待此API的变化。

4 ParameterServerStrategy

他是Tensorflow最初的分布式训练方法，它由若干个parameter servers和若干个worker servers构成，parameter servers用于存储参数，workers用于计算。

ps_strategy = tf.distribute.experimental.ParameterServerStrategy()

ParameterServerStrategy 在训练过程中worker servers会和不同的parameter servers沟通获得参数，然后计算，向parameter servers传递参数的梯度。配置一个这样的训练环境非常简单，只需要在程序运行时设置好环境变量TF_CONFIG，需要注意的是需要给分布式集群里每一个机子不同的task。

os.environ["TF_CONFIG"] = json.dumps({
  "cluster": {
    "worker": ["host1:port", "host2:port", "host3:port"],
    "ps": ["host4:port", "host5:port"]
  },
  "task": {"type": "worker", "index": 1}
})

同时，ParameterServerStrategy还有比较神奇的功能，它可以通过传入num_gpus_per_worker在一个worker上进行多GPU的同步计算，然后不同worker之间进行异步计算。但是由于单一worker上多GPU并没有利用NCCL进行通讯，而是直接将结果发送到CPU，所以效率非常低下。

strategy = tf.distribute.experimental.ParameterServerStrategy()
run_config = tf.estimator.RunConfig(
    experimental_distribute.train_distribute=strategy)
estimator = tf.estimator.Estimator(config=run_config)
tf.estimator.train_and_evaluate(estimator,...)

Examples and Tutorials

Here is a list of tutorials and examples that illustrate the above integration end to end with Keras:

Tutorial to train MNIST using MirroredStrategy.
Tutorial to train MNIST using MultiWorkerMirroredStrategy.
DenseNet example using MirroredStrategy.
BERT example trained using MirroredStrategy and TPUStrategy. This example is particularly helpful for understanding how to load from a checkpoint and generate periodic checkpoints during distributed training etc.
NCF example trained using MirroredStrategy and TPUStrategy that can be enabled using the keras_use_ctl flag.
Transformer trained using MirroredStrategy.
NMT example trained using MirroredStrategy.
Official ResNet50 training with ImageNet data using MirroredStrategy.
ResNet50 trained with Imagenet data on Cloud TPus with TPUStrategy.

We've integrated tf.distribute.Strategy into tf.keras which is TensorFlow's implementation of the Keras API specification. tf.keras is a high-level API to build and train models. By integrating into tf.keras backend, we've made it seamless for Keras users to distribute their training written in the Keras training framework.

The only things that need to change in a user's program are:

(1) Create an instance of the appropriate tf.distribute.Strategy

(2) Move the creation and compiling of Keras model inside strategy.scope.

Here is a snippet of code to do this for a very simple Keras model with one dense layer:

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

三分布式训练的容错 Fault tolerance

在同步训练中, 如果有一个worker 失败了, 整个训练集群就会失败，没有故障恢复机制.

Using Keras with tf.distribute.Strategy comes with the advantage of fault tolerance in cases where workers die or are otherwise unstable. We do this by preserving training state in the distributed file system of your choice, such that upon restart of the instance that previously failed or preempted, the training state is recovered.

Since all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue.

ModelCheckpoint callback

To take advantage of fault tolerance in multi-worker training, provide an instance of tf.keras.callbacks.ModelCheckpoint at the tf.keras.Model.fit() call. The callback will store the checkpoint and training state in the directory corresponding to the filepath argument to ModelCheckpoint.

# Replace the `filepath` argument with a path in the file system
# accessible by all workers.
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath='/tmp/keras-ckpt')]
with strategy.scope():
  multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(x=train_datasets, epochs=3, callbacks=callbacks)

Epoch 1/3
    469/Unknown - 8s 18ms/step - loss: 2.2049 - accuracy: 0.2318WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

469/469 [==============================] - 9s 19ms/step - loss: 2.2049 - accuracy: 0.2318
Epoch 2/3
451/469 [===========================>..] - ETA: 0s - loss: 1.9195 - accuracy: 0.5715INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

469/469 [==============================] - 2s 4ms/step - loss: 1.9113 - accuracy: 0.5767
Epoch 3/3
450/469 [===========================>..] - ETA: 0s - loss: 1.4175 - accuracy: 0.7550INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

INFO:tensorflow:Assets written to: /tmp/keras-ckpt/assets

469/469 [==============================] - 2s 4ms/step - loss: 1.4078 - accuracy: 0.7561

<tensorflow.python.keras.callbacks.History at 0x7fc38fdfee80>

If a worker gets preempted, the whole cluster pauses until the preempted worker is restarted. Once the worker rejoins the cluster, other workers will also restart. Now, every worker reads the checkpoint file that was previously saved and picks up its former state, thereby allowing the cluster to get back in sync. Then the training continues.

If you inspect the directory containing the filepath you specified in ModelCheckpoint, you may notice some temporarily generated checkpoint files. Those files are needed for recovering the previously lost instances, and they will be removed by the library at the end of tf.keras.Model.fit() upon successful exiting of your multi-worker training.

三总结

本文梳理了分布式TensorFlow编程模型的发展，主要从用户使用分布式TensorFlow角度出发，阐述了不同的分布式TensorFlow架构。可以看到，随着TensorFlow的迭代演进，其易用性越来越友好。目前TensorFlow已经发布了2.0.0 正式版本，标志着TensorFlow正式进入2.0时代了，通过不同的Strategy，可以轻松控制使用不同的分布式TensorFlow架构，可见TensorFlow的API设计更加灵活友好，拥有极强的可扩展性，相信将来会出现更多的Strategy来应对复杂的分布式场景。

在2.0版本中，其主打卖点是Eager Execution与Keras高阶API，整体易用性将进一步提升，通过Eager Execution功能，我们可以像使用原生Python一样操作Tensor，而不需要像以前一样需要通过Session.run的方式求解Tensor，另外，通过TensorFlow Keras高阶API，可以更加灵活方便构建模型，同时可以将模型导出为Keras标准格式HDF5，以灵活兼容在线服务等。

补充： Tensorflow 1.0---in-graph 和 between-graph

in-graph模式

In-graph模式，单机多GPU模型有点类似，把计算已经从单机多GPU，已经扩展到了多机多GPU了，不过数据分发还是在一个节点，其他结算节点只需join操作。这样的好处是配置简单，其他多机多GPU的计算节点，只要起个join操作，暴露一个网络接口，等在那里接受任务就好了。这些计算节点暴露出来的网络接口，使用起来就跟本机的一个GPU的使用一样，只要在操作的时候指定tf.device("/job:worker/task:N")，就可以向指定GPU一样，把操作指定到一个计算节点上计算，使用起来和多GPU的类似。但是这样的坏处是训练数据的分发依然在一个节点上，要把训练数据分发到不同的机器上，严重影响并发训练速度。在大数据训练的情况下，不推荐使用这种模式。

对于图内复制，只构建一个Client，这个Client构建一个Graph，Graph中包含一套模型参数，放置在ps上，同时Graph中包含模型计算部分的多个副本，每个副本都放置在一个worker上，这样多个worker可以同时训练复制的模型。

再开一个Python解释器，作为Client，执行如下语句构建计算图，并：

import tensorflow as tf

with tf.device("/job:ps/task:0"):
  w = tf.get_variable([[1., 2., 3.], [1., 3., 5.]])

input_data = ...
inputs = tf.split(input_data, num_workers)
outputs = []

for i in range(num_workers):
  with tf.device("/job:ps/task:%s" % str(i)):
    outputs.append(tf.matmul(inputs[i], w))

output = tf.concat(outputs, axis=0)
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print sess.run(output)

从以上代码可以看到，当采用图内复制时，需要在Client上创建一个包含所有worker副本的流程图，随着worker数量的增长，计算图将会变得非常大，不利于计算图的维护。此外，数据分发在Client单点，要把训练数据分发到不同的机器上，会严重影响并发训练速度。所以在大规模分布式多机训练情况下，一般不会采用图内复制的模式，该模式常用于单机多卡情况下，简单直接。

between-graph模式

between-graph模式下，训练的参数保存在参数服务器，数据不用分发，数据分片的保存在各个计算节点，各个计算节点自己算自己的，算完后把要更新的参数告诉参数服务器，参数服务器更新参数。这种模式的优点是不用进行训练数据的分发，尤其数据量在TB级的时候，节省了大量的时间，所以大数据深度学习推荐使用between-graph模式。

为可以解决图内复制在扩展上的局限性，我们可以采用图间复制模式。对于图间复制，每个worker节点上都创建一个Client，各个Client构建相同的Graph，但是参数还是放置在ps上，每个worker节点单独运算，一个worker节点挂掉了，系统还可以继续跑。

所以我们在第一个worker和第二个worker的Python解释器里继续执行如下语句实现Client完成整个分布式TensorFlow的运行：

with tf.device("/job:ps/task:0"):
  w = tf.get_variable(name='w', shape=[784, 10])
  b = tf.get_variable(name='b', shape=[10])

x = tf.placeholder(tf.float32, shape=[None, 784])
y = tf.placeholder(tf.int32, shape=[None])
logits = tf.matmul(x, w) + b
loss = ...
train_op = ...

with tf.Session() as sess:
  for _ in range(10000):
    sess.run(train_op, feed_dict=...)

在上述描述的过程中，我们是全程手动做分布式驱动的，先建立Cluster，然后构建计算图提交执行，Server上的Master Service和Worker Service根本没有用到。实际应用时当然不会这么愚蠢，一般是将以上代码片段放到一个文件中，通过参数控制执行不同的代码片段，例如：

import tensorflow as tf

ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(
    cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

if FLAGS.job_name == 'ps':
  server.join()
elif FLAGS.job_name == "worker":
  with tf.device(tf.train.replica_device_setter(
      worker_device="/job:worker/task:%d" % FLAGS.task_index,
      cluster=cluster)):
    # Build model...
    loss = ...
    train_op = ...

  with tf.train.MonitoredTrainingSession(
      master="/job:worker/task:0",
      is_chief=(FLAGS.task_index == 0),
      checkpoint_dir="/tmp/train_logs") as mon_sess:
    while not mon_sess.should_stop():
      mon_sess.run(train_op)

每个节点上都执行如上代码，只是不同节点输入的参数不一样，对于ps节点，启动Server后就堵塞等待参数服务，对于worker节点，启动Server后(后台服务)，开始扮演Client，构建计算图，最后通过Session提交计算。注意在调用Session.run之前，仅仅是Client的构图，并未开始计算，各节点上的Server还未发挥作用，只有在调用Session.run后，worker和ps节点才会被派发Task。在调用Session.run时，需要给Session传递target参数，指定使用哪个worker节点上的Master Service，Client将构建的计算图发给target指定的Master Service，一个TensorFlow集群中只有一个Master Service在工作，它负责子图划分、Task的分发以及模型保存与恢复等，在子图划分时，它会自动将模型参数分发到ps节点，将梯度计算分发到worker节点。另外，在Client构图时通过tf.train.replica_device_setter告诉worker节点默认在本机分配Op，这样每个Worker Service收到计算任务后构建出一个单独的计算子图副本，这样每个worker节点就可以单独运行，挂了不影响其他worker节点继续运行。

虽然图间复制具有较好的扩展性，但是从以上代码可以看到，写一个分布式TensorFlow应用，需要用户自行控制不同组件的运行，这就需要用户对TensorFlow的分布式架构有较深的理解。另外，分布式TensorFlow应用与单机版TensorFlow应用的代码是两套，一般使用过程中，用户都是先在单机上调试好基本逻辑，然后再部署到集群，在部署分布式TensorFlow应用前，就需要将前面的单机版代码改写成分布式多机版，用户体验非常差。所以说，使用Low-level 分布式编程模型，不能做到一套代码既可以在单机上运行也可以在分布式多机上运行，其用户门槛较高，一度被相关工程及研究人员诟病。为此，TensorFlow推出了High-level分布式编程模型，极大地改善用户易用性。

同步更新和异步更新

in-graph和between-graph模式都支持同步更新和异步更新。

在同步更新的时候，每次梯度更新，要等所有分发的数据计算完成，返回结果，把梯度累加算了均值之后，再更新参数。这样的好处是loss的下降比较稳定，但这个的坏处也比较明显，处理的速度取决于最慢的那个分片的计算时间。

在异步更新时，所有的计算节点，自己算自己的，更新参数也是自己更新自己的计算结果，这样的优点是计算速度快，计算资源能得到充分利用，但是缺点是loss的下降不稳定，抖动大。

在数据量小的情况下，各个节点的计算能力比较均衡的情况下，推荐使用同步模式；数据量很大，各个机器的计算性能参差不齐的情况下，推荐使用异步的方式。

TensorFlow 1.X 版本的分布式

最原始的分布式TensorFlow

Parameter Server的配置数量也非常复杂，不同的网络环境，模型大小都会对效率有影响，所以现在官方好像也不怎么推荐这种做法了。最原始的分布式TensorFlow编程是基于Low-level API来实现，下面我们通过举例来理解最原始的分布式TensorFlow编程步骤。我们在一台机器上启动三个Server(2个worker，1个ps)来模拟分布式多机环境，开启三个Python解释器(分别对应2个worker和1个ps)，执行如下python语句，定义一个Cluster：

import tensorflow as tf

cluster = tf.train.ClusterSpec({
  "worker": [
      "localhost:2222",
      "localhost:2223"
  ],
  "ps": [
      "localhost:2224"
  ]})在第一个worker解释器内执行如下语句启动Server：

server = tf.train.Server(cluster, job_name="worker", task_index=0)

在第二个worker解释器内执行如下语句启动Server：

server = tf.train.Server(cluster, job_name="worker", task_index=1)在ps解释器内执行如下语句启动Server:

server = tf.train.Server(cluster, job_name="ps", task_index=0)

至此，我们已经启动了一个TensorFlow Cluster，它由两个worker节点和一个ps节点组成，每个节点上都有Master Service和Worker Service，其中worker节点上的Worker Service将负责梯度运算，ps节点上的Worker Service将负责参数更新，三个Master Service将仅有一个会在需要时被用到，负责子图划分与Task派发。

上图所示，假设存在两个任务：

/job:ps/task:0: 负责模型参数的存储和更新
/job:worker/task:0: 负责模型的训练或推理

有了Cluster，我们就可以编写Client，构建计算图，并提交到这个Cluster上执行。使用分布式TensorFlow时，最常采用的分布式训练策略是数据并行，数据并行就是在很多设备上放置相同的模型，在TensorFlow中称之为Replicated training，主要表现为两种模式：图内复制(in-graph replication)和图间复制(between-graph replication)。不同的运行模式，Client的表现形式不一样。

Client
可以把它看成是TensorFlow前端，它支持多语言的编程环境(Python/C++/Go/Java等)，方便用户构造各种复杂的计算图。Client通过Session连接TensorFlow后端，并启动计算图的执行。Client基于TensorFlow的编程接口，构造计算图。此时，TensorFlow并未执行任何计算。直至建立Session会话，并以Session为桥梁，建立Client与后端运行时的通道，将Protobuf格式的GraphDef发送至Distributed Master。也就是说，当Client对OP结果进行求值时，将触发Distributed Master的计算图的执行过程
Master
Master根据要计算的操作(Op)，从计算图中反向遍历，找到其所依赖的最小子图，然后将该子图再次分裂为多个子图片段，以便在不同的进程和设备上运行这些子图片段，最后将这些子图片段派发给Worker执行。
Worker
Worker按照计算子图中节点之间的依赖关系，根据当前的可用的硬件环境(GPU/CPU/TPU)，调用Op的Kernel实现完成运算。对于每个任务，都将存在相应的Worker Service，它主要负责如下3个方面的职责：1 处理来自Master的请求；2 调度OP的Kernel实现，执行本地子图；3 协同任务之间的数据通信。

在分布式TensorFlow中，参与分布式系统的所有节点或者设备统称为一个Cluster，一个Cluster中包含很多Server，每个Server去执行一项Task，Server和Task是一一对应的。

所以，Cluster可以看成是Server的集合，也可以看成是Task的集合，TensorFlow为各个Task又增加了一个抽象层，将一系列相似的Task集合称为一个Job。

一组Task集合(即Job)有若干个Server(host和port标识)，每个Server上会绑定两个Service，就是前面提到的Master Service和Worker Service，Client通过Session连接集群中的任意一个Server的Master Service提交计算图，Master Service负责划分子图并派发Task给Worker Service，Worker Service则负责运算派发过来的Task完成子图的运算。

为什么要分成Cluster Job和Task

首先,我们介绍一下Task:Task就是主机上的一个进程,在大多数情况下,一个机器上只运行一个Task.

为什么Job是Task的集合呢? 在分布式深度学习框架中,我们一般把Job划分为Parameter和Worker,Parameter Job是管理参数的存储和更新工作.Worker Job是来运行ops.如果参数的数量太大,一台机器处理不了,这就要需要多个Tasks.

Cluster 是 Jobs 的集合: Cluster(集群),就是我们用的集群系统了

参数服务器

当计算模型越来越大，模型的参数越来越多，多到模型参数的更新，一台机器的性能都不够时，我们需要将参数分开到不同的机器去存储和更新。参数服务器可以是多台机器组成的集群，类似于分布式的存储结构。主要用来解决参数存储和更新的性能问题。

对于PS架构，Parameter Server的Task集合为ps(即job类型为ps)，而执行梯度计算的Task集合为worker(即job类型为worker)，Low-level 分布式编程模型

High-level 分布式编程模型

TensorFlow提供Estimator和Dataset高阶API，简化模型构建以及数据输入，用户通过Estimator和Dataset高阶API编写TensorFlow应用，不用了解TensorFlow内部实现细节，只需关注模型本身即可。

Estimator代表一个完整的模型，它提供方法用于模型的训练、评估、预测及导出

Estimator具备如下优势：

基于Estimator编写的代码，可运行在单机和分布式环境中，不用区别对待
简化了模型开发者之间共享部署，它提供了标准的模型导出功能，可以将训练好的模型直接用于TensorFlow-Serving等在线服务
提供全套的分布式训练生命周期管理，自动初始化变量、处理异常、创建检查点文件并从故障中恢复、以及保存TensorBoard 的摘要等
提供了一系列开箱即用的常见Estimator，例如DNNClassifier，LinearClassifier等

使用Estimator编写应用时，需将数据输入从模型中分离出来。数据输入可以通过 Dataset API 构建数据 pipeline，类似Spark RDD或DataFrame，可以轻松处理大规模数据、不同的数据格式以及复杂的转换等。具体关于Estimator的使用可以参考TensorFlow官方文档，讲的特别详细。

使用Estimator编写完应用后，可以直接单机上运行，如果需要将其部署到分布式环境运行，则需要在每个节点执行代码前设置集群的TF_CONFIG环境变量(实际应用时通常借助资源调度平台自动完成，如K8S，不需要修改TensorFlow应用程序代码)：

TF_CONFIG='{
    "cluster": {
        "chief": ["host0:2222"],
        "worker": ["host1:2222", "host2:2222", "host3:2222"],
        "ps": ["host4:2222", "host5:2222"]
    },
    "task": {"type": "chief", "index": 0}
}'

TF_CONFIG环境变量是一个json字符串，指定集群规格cluster以及节点自身的角色task，cluster包括chief、worker、ps节点，chief节点其实是一个特殊的worker节点，而且只能有一个节点，表示分布式TensorFlow Master Service所在的节点。

通过以上描述可以看到，使用高阶API编写分布式TensorFlow应用已经很方便了，然而因为PS架构的缘故，我们实际部署时，需要规划使用多少个ps，多少个worker，那么调试过程中，需要反复调整ps和worker的数量。当模型规模较大时，在分布式训练过程中，ps可能成为网络瓶颈，因为所有worker都需要从ps处更新/获取参数，如果ps节点网络被打满，那么worker节点可能就会堵塞等待，以至于其计算能力就发挥不出来。所以后面TensorFlow引入All-Reduce架构解决这类问题。

参考

TensorFlow分布式全套

TensorFlow架构与设计：概述

http://sharkdtu.com/posts/dist-tf-evolution.html

https://zhuanlan.zhihu.com/p/70312627