Distributed TensorFlow，分布式Tensorflow官方文档

最新推荐文章于 2021-02-03 23:51:26 发布

Inc_Cool

最新推荐文章于 2021-02-03 23:51:26 发布

阅读量1.2k

点赞数

分类专栏： Tensorflow 深度学习文章标签： Distribute Tensorflow 翻译中文文档

深度学习同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

Tensorflow

3 篇文章 0 订阅

订阅专栏

Distributed TensorFlow

如果转载请标注出自于Inc_Cool：http://blog.csdn.net/qq_25073253/article/details/53033978

Distributed Tensorflow官方地址：点击进入官方英文文档

本文档显示如何创建一个TensorFlow服务器集群，以及如何在该集群中分布一个计算图形。我们假设你熟悉编写TensorFlow程序的基本概念

1.Hello distributed TensorFlow!

To see a simple TensorFlow cluster in action, execute the following:

# 将TensorFlow服务器作为单进程集群启动".
$ python
>>> import tensorflow as tf
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> server = tf.train.Server.create_local_server()
>>> sess = tf.Session(server.target)  # Create a session on the server.
>>> sess.run(c)
'Hello, distributed TensorFlow!'。

tf.train.Server.create_local_server()方法创建一个具有进程内服务器的单进程集群。

2.Create a cluster

TensorFlow“cluster”是参与TensorFlow图的分布式执行的一组“tasks”。每个任务与TensorFlow“服务器”相关联，TensorFlow“服务器”包含可用于创建会话的“master”和在图中执行操作的“worker”。集群还可以划分为一个或多个“jobs”，其中每个作业包含一个或多个任务。

要创建集群，请在集群中的每个任务启动一个TensorFlow服务器。每个任务通常在不同的机器上运行，但是您可以在同一台机器上运行多个任务（例如，控制不同的GPU设备）。在每个任务中，执行以下操作：
创建一个描述集群中所有任务的tf.train.ClusterSpec。这对于每个任务应该是相同的。
创建一个tf.train.Server，将tf.train.ClusterSpec传递给构造函数，并识别具有作业名称和任务索引的本地任务。

3.Create a tf.train.ClusterSpec to describe the cluster

集群规范字典将作业名称映射到网络地址列表。将此字典传递给 tf.train.ClusterSpec构造函数。例如：

4.Create a tf.train.Server instance in each task

tf.train.Server对象包含一组本地设备，一组与其tf.train.ClusterSpec中的其他任务的连接以及可以使用这些设备执行分布式计算的“session target”。每个服务器都是特定命名作业的成员，并且在该作业中具有任务索引。服务器可以与群集中的任何其他服务器通信。
例如，要启动在localhost：2222和localhost：2223上运行两个服务器的群集，请在本地计算机上的两个不同进程中运行以下代码段：

# In task 0:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# In task 1:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)

注意：手动指定这些集群规范可能很乏味，特别是对于大集群。我们正在开发用于以编程方式启动任务的工具，例如使用类似Kubernetes的集群管理器。如果有特定的集群管理器，您希望看到支持，请提出一个GitHub问题。

5.Specifying distributed devices in your model

要对特定进程放置操作(place opretion)，可以使用相同的tf.device（）函数，用于指定ops是在CPU或GPU上运行。例如：

with tf.device("/job:ps/task:0"):
  weights_1 = tf.Variable(...)
  biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
  weights_2 = tf.Variable(...)
  biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
  input, labels = ...
  layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
  logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)  # ...
  train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
  for _ in range(10000):
    sess.run(train_op)

在上面的示例中，variables是在ps作业中的两个任务上创建的，并且模型的计算密集型部分在worker作业中创建。 TensorFlow将在作业之间插入适当的数据传输（from ps to worker for the forward pass, and from worker to ps for applying gradients）

6.Replicated training

被称为“数据并行性”的常见训练配置涉及在不同小批量数据上训练相同模型的 worker
作业中的多个任务，更新托管在ps作业中的一个或多个任务中的共享参数。所有任务通常在不同的机器上运行。在TensorFlow中有许多方法来指定这个结构，我们正在构建库，这将简化指定复制模型的工作。可能的方法包括：

In-graph复制。在这种方法中，客户端构建单个tf.Graph包含一组参数（在tf.Variable节点固定到/ job：ps）;
并且模型的计算密集型部分的多个副本，每个都固定到/ job：worker中的不同任务。
图之间的复制。在这种方法中，每个/ job：worker任务有一个单独的客户端，通常与worker任务处于相同的进程。每个客户端构建一个包含参数的类似图形（固定到/ job：ps，如前所述使用tf.train.replica_
device_setter（）将它们确定地映射到相同的任务）; 以及模型的计算密集型部分的单个副本，固定到/ job：worker中的本地任务。
异步训练。在该方法中，图的每个副本具有无协调地执行的独立训练循环。它与上述两种形式的复制兼容。
同步训练。在这种方法中，所有副本都读取当前参数的相同值，并行计算梯度，然后将它们应用在一起。
它与图内复制（例如使用CIFAR-10多GPU训练器中的梯度平均）以及图形复制（例如使用tf.train.SyncReplicasOptimizer）兼容。

7.Putting it all together: example trainer program

以下代码显示了分布式训练程序的框架，实现了图中复制和异步训练。它包括参数服务器和工作程序任务的代码。

import tensorflow as tf
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
                           "Comma-separated list of hostname:port pairs")
# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

FLAGS = tf.app.flags.FLAGS

def main(_):
ps_hosts = FLAGS.ps_hosts.split(“,”)
worker_hosts = FLAGS.worker_hosts.split(“,”)
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({“ps”: ps_hosts, “worker”: worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)

if FLAGS.job_name == “ps”:
server.join()
elif FLAGS.job_name == “worker”:
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device=”/job:worker/task:%d” % FLAGS.task_index,
cluster=cluster)):
# Build model…
loss = …
global_step = tf.Variable(0)

  train_op = tf.train.AdagradOptimizer(0.01).minimize(
      loss, global_step=global_step)

  saver = tf.train.Saver()
  summary_op = tf.summary.merge_all()
  init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                         logdir="/tmp/train_logs",
                         init_op=init_op,
                         summary_op=summary_op,
                         saver=saver,
                         global_step=global_step,
                         save_model_secs=600)

# The supervisor takes care of session initialization, restoring from    
# a checkpoint, and closing when done or an error occurs.
with sv.managed_session(server.target) as sess:      
# Loop until the supervisor shuts down or 1000000 steps have completed.

  step = 0
  while not sv.should_stop() and step < 1000000:

# Run a training step asynchronously.        
# See `tf.train.SyncReplicasOptimizer` for additional details on how to        
# perform *synchronous* training.

    _, step = sess.run([train_op, global_step])
# Ask for all the services to stop.
sv.stop()

if name == “main“:
tf.app.run()

要启动具有两个参数服务器和两个workers的训练器，请使用以下命令行（假设脚本名为trainer.py）：


# On ps0.example.com:

$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1