- 分布式测试平台搭建
- 环境:内网两台服务器地址10.0.1.110和10.0.1.111,显卡为nvidia,容量12G.其中地址10.0.1.110的服务器上文件为 t1.py,c1.py,地址10.0.1.111的服务器上文件为c2.py
- 各个文件内容
地址10.0.1.110服务器上的文件
t1.py如下:
import tensorflow as tf
import numpy as np
X = tf.placeholder("float")
Y = tf.placeholder("float")
w = tf.Variable(0.0, name="weight")
b = tf.Variable(0.0, name="reminder")
init_var = tf.global_variables_initializer()
loss = tf.square(Y - tf.multiply(X, w) - b)
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
train_X = np.linspace(-1, 1, 101)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.33 + 10
init_op = tf.initialize_all_variables()
with tf.Session("grpc://10.0.1.110:2222") as sess:
sess.run(init_op)
for i in range(10):
for (x, y) in zip(train_X, train_Y):
sess.run(train_op, feed_dict={X: x, Y: y})
print(sess.run(w))
print(sess.run(b))
c1.py内容如下
import tensorflow as tf
worker_01 = "10.0.1.110:2222"
worker_02 = "10.0.1.111:2222"
worker_hosts = [worker_01, worker_02]
cluster_spec = tf.train.ClusterSpec({"worker": worker_hosts})
server = tf.train.Server(cluster_spec, job_name="worker", task_index=1)
server.join()
地址10.0.1.111服务器上的文件
c2.py内容如下
import tensorflow as tf
worker_01 = "10.0.1.110:2222"
worker_02 = "10.0.1.111:2222"
worker_hosts = [worker_01, worker_02]
cluster_spec = tf.train.ClusterSpec({"worker": worker_hosts})
server = tf.train.Server(cluster_spec, job_name="worker", task_index=0)
server.join()
说明:先启动c1.py和c2.py等待主任务启动