分布式TensorFlow

最新推荐文章于 2021-01-28 15:45:39 发布

pyStar_公众号

最新推荐文章于 2021-01-28 15:45:39 发布

阅读量308

点赞数

分类专栏： AI tensorflow 文章标签： tensorflow 分布式部署

本文链接：https://blog.csdn.net/qq_42413820/article/details/80904212

版权

AI 同时被 2 个专栏收录

20 篇文章 0 订阅

订阅专栏

tensorflow

15 篇文章 0 订阅

订阅专栏

在大型的数据集上进行神经网络训练，往往需要更大的运算资源, 而且需要耗费的时间也是很久的。因此TensorFlow提供了一个可以分布式部署的模式，将一个训练任务拆成若干个小任务，分配到不同的计算机来完成协同运算，这样可以节省大量的时间。

我们先看一下简单情况下的训练模式：
1）单CPU单GPU

这种情况就是最简单的，对于这种情况，可以把参数和计算都定义再gpu上，不过如果参数模型比较大，显存不足等情况，就得放在CPU。

import  tensorflow as tf 
with tf.device('/cpu:0'):   #   也可以放在gpu上
	w=tf.get_variable('w',[1],tf.float32,initializer=tf.constant_initializer(2))
	b=tf.get_variable('b',[1],tf.float32,initializer=tf.constant_initializer(5))
with tf.device('/gpu:0'):
	add=w+b
	mut=w*b
init = tf.initialize_all_variables()
with tf.Session() as sess:
	sess.run(init)
	tensor1,tensor2=sess.run([add,mut])
	print tensor1
	print tensor2

2) 单CPU多GPU

这种情况我们就可以指定不同的GPU进行训练了。一般共享操作定义在cpu上，然后并行操作定义在各自的gpu上，比如对于深度学习来说，我们一般参数定义、参数梯度更新统一放在cpu上，各个gpu通过各自计算各自batch 数据的梯度值，然后统一传到cpu上，由cpu计算求取平均值，更新参数。

具体的深度学习多GPU训练代码，请参考：

https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py

import  tensorflow as tf
with tf.device('/cpu:0'):
	w=tf.get_variable('w',[1],tf.float32,initializer=tf.constant_initializer(2))
	b=tf.get_variable('b',[2],tf.float32,initializer=tf.constant_initializer(5))
with tf.device('/gpu:0'):
	add=w+b
with tf.device('/gpu:1'):
	mut=w*b
init = tf.initialize_all_variables()
with tf.Session() as sess:
	sess.run(init)
	print sess.run([add,mut])

3）多CPU多GPU

这个时候就会定义各自的角色，便于不同角色之间相互配合，分工明确。

Cluster、Job、task概念：

task可以看成每台机器上的一个进程，多个task组成job

job又可分为：ps(Parameter Server)、worker两种，分别用于参数服务、计算服务，组成cluster。

tensorflow的分布式有in-graph和between-gragh两种架构模式。

in-graph 模式

in-graph模式，把计算已经从单机多GPU，已经扩展到了多机多GPU了，不过数据分发还是在一个节点。这样的好处是配置简单，其他多机多GPU的计算节点，只要起个join操作，暴露一个网络接口，等在那里接受任务就好了。这些计算节点暴露出来的网络接口，使用起来就跟本机的一个GPU的使用一样，只要在操作的时候指定tf.device("/job:worker/task:n")，就可以向指定GPU一样，把操作指定到一个计算节点上计算，使用起来和多GPU的类似。但是这样的坏处是训练数据的分发依然在一个节点上，要把训练数据分发到不同的机器上，严重影响并发训练速度。在大数据训练的情况下，不推荐使用这种模式。

between-graph模式

between-graph模式下，训练的参数保存在参数服务器，数据不用分发，数据分片的保存在各个计算节点，各个计算节点自己算自己的，算完了之后，把要更新的参数告诉参数服务器，参数服务器更新参数。这种模式的优点是不用训练数据的分发了，尤其是在数据量在TB级的时候，节省了大量的时间，所以大数据深度学习还是推荐使用between-graph模式。

同步更新和异步更新

TensorFlow的两种模式都支持同步更新和异步更新。

同步更新：将数据拆分成多份，每份基于参数计算出各自部分的梯度；当每一份的部分梯度计算完成后，收集到一起算出总梯度，再用总梯度去更新参数。
异步更新：同步更新模式下，每次都要等各个部分的梯度计算完后才能进行参数更新操作，处理速度取决于计算梯度最慢的那个部分，其他部分存在大量的等待时间浪费；异步更新模式下，所有的部分只需要算自己的梯度，根据自己的梯度更新参数，不同部分之间不存在通信和等待。

下面通过代码解释各种函数及某些用法的含义：

import numpy as np
import tensorflow as tf

flags = tf.app.flags
# 定义角色名称
flags.DEFINE_string('job_name', None, 'job name: worker or ps')
# 指定任务的编号
flags.DEFINE_integer('task_index', None, 'Index of task within the job')
# 定义ip和端口
flags.DEFINE_string('ps_hosts', 'localhost:1681', 'Comma-separated list of hostname:port pairs')
flags.DEFINE_string('worker_hosts', 'localhost:1682,localhost:1683', 'Comma-separated list of hostname:port pairs')
# 定义保存文件的目录
flags.DEFINE_string('log_dir', 'log/super/', 'directory path')
# 训练参数设置
flags.DEFINE_integer('training_epochs', 20, 'training epochs')
FLAGS = flags.FLAGS

上面的代码就很好理解了，只是定义了一些参数。

1) 在运行时通过 job_name 和 task_index 传递参数，定义不同的角色(主要是 ps 和 worker)和任务编号

2) 通过 ps_hosts 和 worker_hosts 定义参与训练的主机 ip 和端口，用 ' , ' 隔开。

# 生成模拟数据
train_X = np.linspace(-1, 1, 100)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.3  # y=2x，但是加入了噪声

tf.reset_default_graph()

ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts, 'worker': worker_hosts})

上面这段代码主要就是：

1）生成模拟数据

2）分割 ps_hosts 和 worker_hosts，然后通过 tf.train.ClusterSpec() 把你要跑这个任务的所有 ps 和 worker 节点的ip和端口的信息都包含进去，所有的节点都要执行这段代码，大家就互相知道这个集群里面都有哪些成员，不同的成员的角色是什么，是 ps 还是 worker。

# 创建server
server = tf.train.Server({'ps': ps_hosts, 'worker': worker_hosts},
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index)
# ps角色使用join进行等待
if FLAGS.job_name == 'ps':
    print("waiting...")
    server.join()

tf.train.Server() 将根据参数对主机进行分工。根据参数的不同，决定了这个任务是哪个任务。如果任务名字是 ps 的话，程序就join到这里，等待其他主机的连接，作为参数更新的服务，等待其他worker节点给他提交参数和更新的数据。如果是worker任务，就执行后面的计算任务。

with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster_spec)):
    X = tf.placeholder("float")
    Y = tf.placeholder("float")
    # 模型参数
    W = tf.Variable(tf.random_normal([1]), name="weight")
    b = tf.Variable(tf.zeros([1]), name="bias")

    global_step = tf.contrib.framework.get_or_create_global_step()  # 获得迭代次数

    # 前向结构
    z = tf.multiply(X, W) + b
    # 反向优化
    cost = tf.reduce_mean(tf.square(Y - z))
    learning_rate = 0.01
    # Gradient descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, global_step=global_step)

    init = tf.global_variables_initializer()

tf.device() 函数中的任务是通过 tf.train.replica_device_setter() 来指定的。

tf.train.replica_device_setter() 可以看看文章后面的具体参数。

worker_device 定义具体的任务名称，
cluster 指定角色及对应的IP地址，从而实现管理整个任务下的图节点。

init = tf.global_variables_initializer() 是将前面的参数全部初始化，如果后面在再有变量，将不会被初始化。

在这个with语句之下定义的参数，会自动分配到参数服务器上去定义，如果有多个参数服务器，就轮流循环分配。

sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                         init_op=init,
                         global_step=global_step)

tf.train.Supervisor(）类似一个监督者，在分布式训练中，很多机器都在运行，像什么参数初始化，保存模型，写summary.......，这个supervisoer帮你一起弄起来了，就不用自己手动去做这些事情了，而且在分布式的环境下涉及到各种参数的共享，比较麻烦，所以就有了 tf.train.Supervisor(）

      is_chief 表明是否为 chief supervisors 角色，这里将 task_index=0 的worker设置成了 chief supervisors 。负责初始化参数， 模型的保存，summary的保存。 
      init_op 表示使用初始化变量的函数。
      global_step是可以所有计算节点共享的，在执行optimizer的minimize的时候，会自动加1

在这个函数中，已经通过 init_op 初始化参数了，所以就不需要在运行 sess.run(init) 来初始化参数了，如果用其再次初始化，会导致载入模型的变量被清空。

其他的一些参数：

logdir 就是检查点文件和summary文件的保存路径。 训练启动就会去logdir的目录去看有没有checkpoint的文件，有的话就自动装载，没有就用init_op指定的初始化参数。
saver 需要保存检查点的saver对象传入，supervisor就会自动保存检查点文件。如不想自动保存设置为None
summary_op 也是自动保存summary文件。设置为None，表示不自动保存。
save_model_secs 为保存检查点文件的时间间隔。

# 连接目标角色创建session
with sv.managed_session(server.target) as sess:
    print(global_step.eval(session=sess))

    for epoch in range(global_step.eval(session=sess), FLAGS.training_epochs*len(train_X)):

        for (x, y) in zip(train_X, train_Y):
            _, epoch = sess.run([optimizer, global_step], feed_dict={X: x, Y: y})

            loss = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
            print("Epoch:", epoch + 1, "cost=", loss, "W=", sess.run(W), "b=", sess.run(b))

    print(" Finished!")
sv.stop()

上面的代码是通过 tf.train.Supervisor() 中的managed_session来管理打开一个session。session中只负责运算，而通信协调的事情就会交给supervisor来管理。

在上面的程序中如果要保存 summary 文件，将使用sv.summary_computed(), 想要手动保存使用 sv.saver.save(),在设置自动保存检查点文件之后，手动保存仍然有效。在程序运行中止时，在运行 supervisor 时会自动载入模型的参数，不需要手动调用saver.restore()。

但是在分布式部署时，保存 summary 还需要注意几点：

1）不是 chief supervisor 不能使用 sv.summary_computed() ，即使使用也无法执行，还会报错

2）手写控制 summary 与检查点文件保存时，需要将chief supervisor 以外的worker全部去掉才可以。可以使用 supervisor 按时间间隔保存的形式来管理，这样用一套代码就足够了。

下面是完整的代码：

运行时打开三个终端，分别输入：

1）python  distribute.py  --job_name=ps  --task_index=0
2）python  distribute.py  --job_name=worker  --task_index=0
3）python  distribute.py  --job_name=worker  --task_index=1

import numpy as np
import tensorflow as tf

flags = tf.app.flags

# 定义角色名称
flags.DEFINE_string('job_name', None, 'job name: worker or ps')
# 指定任务的编号
flags.DEFINE_integer('task_index', None, 'Index of task within the job')

# 定义ip和端口
flags.DEFINE_string('ps_hosts', 'localhost:1681', 'Comma-separated list of hostname:port pairs')
flags.DEFINE_string('worker_hosts', 'localhost:1682,localhost:1683', 'Comma-separated list of hostname:port pairs')
# 定义保存文件的目录
flags.DEFINE_string('log_dir', 'log/super/', 'directory path')

# 参数设置
flags.DEFINE_integer('training_epochs', 20, 'training epochs')

FLAGS = flags.FLAGS

# 生成模拟数据
train_X = np.linspace(-1, 1, 100)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.3  # y=2x，但是加入了噪声

tf.reset_default_graph()

ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts, 'worker': worker_hosts})
# 创建server
server = tf.train.Server({'ps': ps_hosts, 'worker': worker_hosts},
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index)

# ps角色使用join进行等待
if FLAGS.job_name == 'ps':
    print("waiting...")
    server.join()

with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster_spec)):
    X = tf.placeholder("float")
    Y = tf.placeholder("float")
    # 模型参数
    W = tf.Variable(tf.random_normal([1]), name="weight")
    b = tf.Variable(tf.zeros([1]), name="bias")

    global_step = tf.contrib.framework.get_or_create_global_step()  # 获得迭代次数

    # 前向结构
    z = tf.multiply(X, W) + b
    # 反向优化
    cost = tf.reduce_mean(tf.square(Y - z))
    learning_rate = 0.01
    # Gradient descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, global_step=global_step)

    init = tf.global_variables_initializer()


sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                         init_op=init,
                         global_step=global_step)

# 连接目标角色创建session
with sv.managed_session(server.target) as sess:
    print(global_step.eval(session=sess))

    for epoch in range(global_step.eval(session=sess), FLAGS.training_epochs*len(train_X)):

        for (x, y) in zip(train_X, train_Y):
            _, epoch = sess.run([optimizer, global_step], feed_dict={X: x, Y: y})

            loss = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
            print("Epoch:", epoch + 1, "cost=", loss, "W=", sess.run(W), "b=", sess.run(b))

    print(" Finished!")
sv.stop()

replica_device_setter(ps_tasks=0, ps_device="/job:ps",
                          worker_device="/job:worker", merge_devices=True,
                          cluster=None, ps_ops=None, ps_strategy=None):
  """Return a `device function` to use when building a Graph for replicas.

  Device Functions are used in `with tf.device(device_function):` statement to
  automatically assign devices to `Operation` objects as they are constructed,
  Device constraints are added from the inner-most context first, working
  outwards. The merging behavior adds constraints to fields that are yet unset
  by a more inner context. Currently the fields are (job, task, cpu/gpu).

  If `cluster` is `None`, and `ps_tasks` is 0, the returned function is a no-op.
  Otherwise, the value of `ps_tasks` is derived from `cluster`.

  By default, only Variable ops are placed on ps tasks, and the placement
  strategy is round-robin over all ps tasks. A custom `ps_strategy` may be used
  to do more intelligent placement, such as
  `tf.contrib.training.GreedyLoadBalancingStrategy`.

  For example,

  ```python
  # To build a cluster with two ps jobs on hosts ps0 and ps1, and 3 worker
  # jobs on hosts worker0, worker1 and worker2.
  cluster_spec = {
      "ps": ["ps0:2222", "ps1:2222"],
      "worker": ["worker0:2222", "worker1:2222", "worker2:2222"]}
  with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
    # Build your graph
    v1 = tf.Variable(...)  # assigned to /job:ps/task:0
    v2 = tf.Variable(...)  # assigned to /job:ps/task:1
    v3 = tf.Variable(...)  # assigned to /job:ps/task:0
  # Run compute
  ```

  Args:
    ps_tasks: Number of tasks in the `ps` job.  Ignored if `cluster` is
      provided.
    ps_device: String.  Device of the `ps` job.  If empty no `ps` job is used.
      Defaults to `ps`.
    worker_device: String.  Device of the `worker` job.  If empty no `worker`
      job is used.
    merge_devices: `Boolean`. If `True`, merges or only sets a device if the
      device constraint is completely unset. merges device specification rather
      than overriding them.
    cluster: `ClusterDef` proto or `ClusterSpec`.
    ps_ops: List of strings representing `Operation` types that need to be
      placed on `ps` devices.  If `None`, defaults to `STANDARD_PS_OPS`.
    ps_strategy: A callable invoked for every ps `Operation` (i.e. matched by
      `ps_ops`), that takes the `Operation` and returns the ps task index to
      use.  If `None`, defaults to a round-robin strategy across all `ps`
      devices.

  Returns:
    A function to pass to `tf.device()`.

  Raises:
    TypeError if `cluster` is not a dictionary or `ClusterDef` protocol buffer,
    or if `ps_strategy` is provided but not a callable.
  """

pyStar_公众号

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分布式TensorFlow

在大型的数据集上进行神经网络训练，往往需要更大的运算资源, 而且需要耗费的时间也是很久的。因此TensorFlow提供了一个可以分布式部署的模式，将一个训练任务拆成若干个小任务，分配到不同的计算机来完成协同运算，这样可以节省大量的时间。我们先看一下简单情况下的训练模式：1）单CPU单GPU 这种情况就是最简单的，对于这种情况，可以把参数和计算都定义再gpu上，不过如果参数模...
复制链接

扫一扫

专栏目录