文章目录
什么叫distributed tensorflow training,它到底干什么?
就是把一个graph分成N部分分给N个server去计算(inGraph),或者把N张graph分给N个server去计算(betweenGraph),混合的方式也可以。
Cluster
这个N个server就是Cluster,就是我们中文的“集群”。
master和workers
这N个server里面有一台server叫master(当然也可以兼职worker的工作),其他的N-1个server叫worker,master的本职工作就是:执行Graph,按照Graph分配任务给workers,并负责协调workers之间的交流,整理结果。
client
master是这N个servers对外的负责人,master会和client对接。
啥是client?就是服务器的客户,就是买服务的人,说人话就是写代码,运行主程序的地方,它的工作是:创建Graph(这个图里包含了任务分配),并把创建好的Graph发送给master,master完成任务后把结果发给client。
最简单的模型Client—Master(worker)代码实现:
Master(worker):即当老板又当员工,属于个体户,local就不用接受任务了,来什么任务就执行什么任务
import tensorflow as tf
server=tf.train.Server.create_local_server()
server.join()
执行完毕会有一个该server的IP标识:grpc://172.16.100.2:12222
Client:build graph,把执行图的任务发给Master:
build graph
import tensorflow as tf
import numpy as np
# create data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data*0.1 + 0.3
### create tensorflow structure start ###
Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
biases = tf.Variable(tf.zeros([1]))
y = Weights*x_data + biases
loss = tf.reduce_mean(tf.square(y-y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
### create tensorflow structure end ###
把执行图的任务发给Master
server_target="grpc://172.16.100.2:12222" #master
# create data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data*0.1 + 0.3
with tf.Session(server_target) as sess:
sess.run(init)
for step in range(1000):
sess.run(train)
if step % 20 == 0:
print(step, sess.run(Weights), sess.run(biases))
通用的的模型Client—Master(worker)–Workers代码实现:
Master/workers:因为安排任务已经在graph里面了,Master和workers没有区别(master隐含执行了sess.run):
tf.train.Server(cluster, job_name=“canshu”, task_index=0)描述了当前主机需要执行的任务,就是当前主机接受的任务:
每一个workers都被/job:job_id/task:task_id给唯一确定了。
# 172.16.100.2:12222主机:工种为canshu的,第0个task
import tensorflow as tf
cluster=tf.train.ClusterSpec({
"canshu": [
"172.16.100.2:12222",# /job:canshu/task:0 运行的主机
"172.16.100.3:12222",# /job:canshu/task:1 运行的主机
],
"gongzuo": [
"172.16.100.4:12222", # /job:gongzuo/task:0 运行的主机
"172.16.100.5:12222" # /job:gongzuo/task:1 运行的主机
]})
server = tf.train.Server(cluster, job_name="canshu", task_index=0)
server.join()
# 172.16.100.3:12222主机:工种为canshu的,第1个task
import tensorflow as tf
cluster=tf.train.ClusterSpec({
"canshu": [
"172.16.100.2:12222",# /job:canshu/task:0 运行的主机
"172.16.100.3:12222",# /job:canshu/task:1 运行的主机
],
"gongzuo": [
"172.16.100.4:12222", # /job:gongzuo/task:0 运行的主机
"172.16.100.5:12222" # /job:gongzuo/task:1 运行的主机
]})
server = tf.train.Server(cluster, job_name="canshu", task_index=1)
server.join()
# 172.16.100.4:12222主机:工种为gongzuo的,第0个task
import tensorflow as tf
cluster=tf.train.ClusterSpec({
"canshu": [
"172.16.100.2:12222",# /job:canshu/task:0 运行的主机
"172.16.100.3:12222",# /job:canshu/task:1 运行的主机
],
"gongzuo": [
"172.16.100.4:12222", # /job:gongzuo/task:0 运行的主机
"172.16.100.5:12222" # /job:gongzuo/task:1 运行的主机
]})
server = tf.train.Server(cluster, job_name="gongzuo", task_index=0)
server.join()
# 172.16.100.5:12222主机:工种为gongzuo的,第1个task
import tensorflow as tf
cluster=tf.train.ClusterSpec({
"canshu": [
"172.16.100.2:12222",# /job:canshu/task:0 运行的主机
"172.16.100.3:12222",# /job:canshu/task:1 运行的主机
],
"gongzuo": [
"172.16.100.4:12222", # /job:gongzuo/task:0 运行的主机
"172.16.100.5:12222" # /job:gongzuo/task:1 运行的主机
]})
server = tf.train.Server(cluster, job_name="gongzuo", task_index=1)
server.join()
Client:build graph,把执行图的任务发给Master:
build graph
import tensorflow as tf
import numpy as np
# create data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data*0.1 + 0.3
### create tensorflow structure start ###
starttime=datetime.now()
with tf.device("/job:canshu/task:0"):
Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
with tf.device("/job:canshu/task:1"):
biases = tf.Variable(tf.zeros([1]))
with tf.device("/job:gongzuo/task:0"):
y = Weights*x_data + biases
loss = tf.reduce_mean(tf.square(y-y_data))
with tf.device("/job:gongzuo/task:1"):
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
把执行图的任务发给Master:
server_target="grpc://172.16.100.2:12222" #master
with tf.Session(server_target) as sess:
sess.run(init)
for step in range(1000):
sess.run(train)
if step % 20 == 0:
print(step, sess.run(Weights), sess.run(biases))