Tensorflow分布式并行技术实践

最新推荐文章于 2022-10-19 10:21:53 发布

_well_s

最新推荐文章于 2022-10-19 10:21:53 发布

阅读量1.3k

点赞数

分类专栏：深度学习文章标签：深度学习分布式框架

本文链接：https://blog.csdn.net/u011987514/article/details/71300484

版权

本文实践了tensorflow的分布式并行技术

Tensor的分布式有几种模式，In-graph replication模型并行，将模型的计算图的不同部分放在不同机器执行；

between-graph replication数据并行，每台机器使用完全相同的计算图，但是计算不同的batch数据。

此外，还有异步并行和同步并行，异步并行指每台机器独立计算梯度，一旦计算完就更新到paramter server中，不等其他机器

同步并行是等所有机器都完成对梯度的计算后，将多个梯度合成并统一更新模型参数。

同步并行训练loss下降速度更快，可以达到的更大精度最高

但是同步并行的木桶效应导致集群速度取决于最慢的机器，当设备速度一致时效率比较高

下面用tensorflow实现1个parameter server和1个worker的分布式并行训练程序

# -*- coding: utf-8 -*-
import math
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

flags = tf.app.flags
flags.DEFINE_string("data_dir","/tmp/mnist_data","storing data")
flags.DEFINE_integer("hidden_units",100,"hidden layer")
flags.DEFINE_integer("train_steps",1000,"step")
flags.DEFINE_integer("batch_size",100,"batch size")
flags.DEFINE_float("learning_rate",0.01,"learning rate")
flags.DEFINE_boolean("sync_replicas",False,"Use the sync_replicas mode")
flags.DEFINE_integer("replicas_to_aggregate",None,"Number of replicas to aggregate before parameter "
                                                  "update is applied ,defalut:num_workers")
# None代表worker的数量，即所有worker都完成一个batch的训练后再更新模型参数
flags.DEFINE_string("ps_hosts","192.168.0.107:2222","comma-separated lst of hostname:port pairs")
flags.DEFINE_string("worker_hosts","10.211.55.14:2222","comma-separated lst of hostname:port pairs")
flags.DEFINE_string("job_name",None,"job name:worker or ps")
flags.DEFINE_integer("task_index",None,"Worker task index,should be >=0, task=0 is "
                                       "the master worker task the performs the variable initial

最低0.47元/天解锁文章

_well_s

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Tensorflow分布式并行技术实践

本文实践了tensorflow的分布式并行技术Tensor的分布式有几种模式，In-graph replication模型并行，将模型的计算图的不同部分放在不同机器执行； between-graph replication数据并行，每台机器使用完全相同的计算图，但是计算不同的batch数据。此外，还有异步并行和同步
复制链接

扫一扫