Docker容器中实现Tensorflow分布式训练
一、简介
Tensorflow分布式介绍:tensorflow分布式训练主要有以下几种形式–单机多卡、多机单卡、多机多卡;以上几种形式是基于PS结构的,使用的通信方式–同步(同步SGD)、异步(异步SGD) 。
- 环境
- win10+ 虚拟机 + Centos7 + docker + tensorflow
- 内容
- 本文的主要内容是使用docker容器来模拟多机多卡的情况
二、在docker中实现分布式训练
-
Tensorflow镜像的制作
- 编辑Dockerfile 文件
vim tensorflow-cpu
,并写入
注:FROM centos:7.3.1611 MAINTAINER urmsone RUN yum install -y vim net-tools RUN yum install -y epel-release && yum install -y gcc python-devel python2-pip && pip install --upgrade pip # && pip install jupyter && python -m ipykernel.kernelspec RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ https://mirrors.tuna.tsinghua.edu.cn/tensorflow/linux/cpu/tensorflow-1.3.0-cp27-none-linux_x86_64.whl # 复制分布式脚本到容器根目录 COPY mnist_replica.py / # 复制data目录下的mnist数据集到容器的/tmp/mnist-data/目录中 COPY data/* /tmp/mnist-data/
- 拉取分布式训练脚本
curl https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_replica.py -o mnist_replica.py
- 准备mnist数据集
- 完成镜像制作
docker build -t tensorflow:1 -f tensorflow-cpu .
- 编辑Dockerfile 文件
-
启动容器
- 启动容器ps,作为参数服务器
docker run -it --name ps -p 2222 --rm tensorflow:1 /bin/bash
- 启动容器worker1,作为计算服务器1
docker run -it --name worker1 -p 2222 --rm tensorflow:1 /bin/bash
- 启动容器worker2,作为计算服务器2
docker run -it --name worker2 -p 2222 --rm tensorflow:1 /bin/bash
- 启动容器ps,作为参数服务器
-
在容器中搭建集群并开始分布式训练
- 在ps容器中执行
python mnist_replica.py --ps_hosts=172.17.0.2:2222 --worker_hosts=172.17.0.5:2222,172.17.0.3:2222 --job_name=ps --task_index=0
- 在worker1容器中执行
python mnist_replica.py --ps_hosts=172.17.0.2:2222 --worker_hosts=172.17.0.5:2222,172.17.0.3:2222 --job_name=worker --task_index=0
- 在worker2容器中执行
python mnist_replica.py --ps_hosts=172.17.0.2:2222 --worker_hosts=172.17.0.5:2222,172.17.0.3:2222 --job_name=worker --task_index=1
注:python命令中的–ps_hosts为ps容器的ip地址,–worker_hosts为worker1和worker2的ip地址;可以使用命令
docker inspect ps |grep Addr
查看容器的ip地址 - 在ps容器中执行
-
报错总结
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server
原因:上次运行的python mnist_replica.py
没有中断,再次运行时,导致发生未知错误。
解决办法:只需让正在运行的程序终止运行,然后再重新运行就好了。TensorFlow IOError: [Errno socket error] [Errno 104] Connection reset by peer
原因:input_data.read_data_sets()读取mnist数据集时,如果文件不存在,会自动的远程拉取。如拉取数据集的远程url需要翻墙才能访问时,就会报以上的网络错误。
解决方法:更换不需翻墙的url或者手动下载mnist数据集。
附录
github官网代码:代码链接
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Distributed MNIST training and validation, with model replicas.
A simple softmax model with one hidden layer is defined. The parameters
(weights and biases) are located on two parameter servers (ps), while the
ops are defined on a worker node. The TF sessions also run on the worker
node.
Multiple invocations of this script can be done in parallel, with different
values for --worker_index. There should be exactly one invocation with
--worker_index, which will create a master session that carries out variable
initialization. The other, non-master, sessions will wait for the master
session to finish the initialization before proceeding to the training stage.
The coordination between the multpile worker invocations occurs due to
the definition of the parameters on the same ps devices. The parameter updates
from one worker is visible to all other workers. As such, the workers can
perform forward computation and gradient calculation in parallel, which
should lead to increased training speed for the simple model.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import sys
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
flags = tf.app.flags
flags.DEFINE_string("data_dir", "/tmp/mnist-data",
"Directory for storing mnist data")
flags.DEFINE_boolean("download_only", False,
"Only perform downloading of data; Do not proceed to "
"session preparation, model definition or training")
flags.DEFINE_integer("worker_index", 0,
"Worker task index, should be >= 0. worker_index=0 is "
"the master worker task the performs the variable "
"initialization ")
flags.DEFINE_integer("num_workers", None,
"Total number of workers (must be >= 1)")
flags.DEFINE_integer("num_parameter_servers", 2,
"Total number of parameter servers (must be >= 1)")
flags.DEFINE_integer("replicas_to_aggregate", None,
"Number of replicas to aggregate before paramter update"
"is applied (For sync_replicas mode only; default: "
"num_workers)")
flags.DEFINE_integer("grpc_port", 2222,
"TensorFlow GRPC port")
flags.DEFINE_integer("hidden_units", 100,
"Number of units in the hidden layer of the NN")
flags.DEFINE_integer("train_steps", 200,
"Number of (global) training steps to perform")
flags.DEFINE_integer("batch_size", 100, "Training batch size")
flags.DEFINE_float("learning_rate", 0.01, "Learning rate")
flags.DEFINE_string("worker_grpc_url", None,
"Worker GRPC URL (e.g., grpc://1.2.3.4:2222, or "
"grpc://tf-worker0:2222)")
flags.DEFINE_boolean("sync_replicas", False,
"Use the sync_replicas (synchronized replicas) mode, "
"wherein the parameter updates from workersare aggregated "
"before applied to avoid stale gradients")
FLAGS = flags.FLAGS
IMAGE_PIXELS = 28
PARAM_SERVER_PREFIX = "tf-ps" # Prefix of the parameter servers' domain names
WORKER_PREFIX = "tf-worker" # Prefix of the workers' domain names
def get_device_setter(num_parameter_servers, num_workers):
"""Get a device setter given number of servers in the cluster.
Given the numbers of parameter servers and workers, construct a device
setter object using ClusterSpec.
Args:
num_parameter_servers: Number of parameter servers
num_workers: Number of workers
Returns:
Device setter object.
"""
ps_spec = []
for j in range(num_parameter_servers):
ps_spec.append("%s%d:%d" % (PARAM_SERVER_PREFIX, j, FLAGS.grpc_port))
worker_spec = []
for k in range(num_workers):
worker_spec.append("%s%d:%d" % (WORKER_PREFIX, k, FLAGS.grpc_port))
cluster_spec = tf.train.ClusterSpec({
"ps": ps_spec,
"worker": worker_spec})
# Get device setter from the cluster spec
return tf.train.replica_device_setter(cluster=cluster_spec)
def main(unused_argv):
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
if FLAGS.download_only:
sys.exit(0)
print("Worker GRPC URL: %s" % FLAGS.worker_grpc_url)
print("Worker index = %d" % FLAGS.worker_index)
print("Number of workers = %d" % FLAGS.num_workers)
# Sanity check on the number of workers and the worker index
if FLAGS.worker_index >= FLAGS.num_workers:
raise ValueError("Worker index %d exceeds number of workers %d " %
(FLAGS.worker_index, FLAGS.num_workers))
# Sanity check on the number of parameter servers
if FLAGS.num_parameter_servers <= 0:
raise ValueError("Invalid num_parameter_servers value: %d" %
FLAGS.num_parameter_servers)
is_chief = (FLAGS.worker_index == 0)
if FLAGS.sync_replicas:
if FLAGS.replicas_to_aggregate is None:
replicas_to_aggregate = FLAGS.num_workers
else:
replicas_to_aggregate = FLAGS.replicas_to_aggregate
# Construct device setter object
device_setter = get_device_setter(FLAGS.num_parameter_servers,
FLAGS.num_workers)
# The device setter will automatically place Variables ops on separate
# parameter servers (ps). The non-Variable ops will be placed on the workers.
with tf.device(device_setter):
global_step = tf.Variable(0, name="global_step", trainable=False)
# Variables of the hidden layer
hid_w = tf.Variable(
tf.truncated_normal([IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units],
stddev=1.0 / IMAGE_PIXELS), name="hid_w")
hid_b = tf.Variable(tf.zeros([FLAGS.hidden_units]), name="hid_b")
# Variables of the softmax layer
sm_w = tf.Variable(
tf.truncated_normal([FLAGS.hidden_units, 10],
stddev=1.0 / math.sqrt(FLAGS.hidden_units)),
name="sm_w")
sm_b = tf.Variable(tf.zeros([10]), name="sm_b")
# Ops: located on the worker specified with FLAGS.worker_index
x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS])
y_ = tf.placeholder(tf.float32, [None, 10])
hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b)
hid = tf.nn.relu(hid_lin)
y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b))
cross_entropy = -tf.reduce_sum(y_ *
tf.log(tf.clip_by_value(y, 1e-10, 1.0)))
opt = tf.train.AdamOptimizer(FLAGS.learning_rate)
if FLAGS.sync_replicas:
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=replicas_to_aggregate,
total_num_replicas=FLAGS.num_workers,
replica_id=FLAGS.worker_index,
name="mnist_sync_replicas")
train_step = opt.minimize(cross_entropy,
global_step=global_step)
if FLAGS.sync_replicas and is_chief:
# Initial token and chief queue runners required by the sync_replicas mode
chief_queue_runner = opt.get_chief_queue_runner()
init_tokens_op = opt.get_init_tokens_op()
init_op = tf.initialize_all_variables()
train_dir = tempfile.mkdtemp()
sv = tf.train.Supervisor(is_chief=is_chief,
logdir=train_dir,
init_op=init_op,
recovery_wait_secs=1,
global_step=global_step)
sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=True,
device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.worker_index])
# The chief worker (worker_index==0) session will prepare the session,
# while the remaining workers will wait for the preparation to complete.
if is_chief:
print("Worker %d: Initializing session..." % FLAGS.worker_index)
else:
print("Worker %d: Waiting for session to be initialized..." %
FLAGS.worker_index)
sess = sv.prepare_or_wait_for_session(FLAGS.worker_grpc_url,
config=sess_config)
print("Worker %d: Session initialization complete." % FLAGS.worker_index)
if FLAGS.sync_replicas and is_chief:
# Chief worker will start the chief queue runner and call the init op
print("Starting chief queue runner and running init_tokens_op")
sv.start_queue_runners(sess, [chief_queue_runner])
sess.run(init_tokens_op)
# Perform training
time_begin = time.time()
print("Training begins @ %f" % time_begin)
local_step = 0
while True:
# Training feed
batch_xs, batch_ys = mnist.train.next_batch(FLAGS.batch_size)
train_feed = {x: batch_xs,
y_: batch_ys}
_, step = sess.run([train_step, global_step], feed_dict=train_feed)
local_step += 1
now = time.time()
print("%f: Worker %d: training step %d done (global step: %d)" %
(now, FLAGS.worker_index, local_step, step))
if step >= FLAGS.train_steps:
break
time_end = time.time()
print("Training ends @ %f" % time_end)
training_time = time_end - time_begin
print("Training elapsed time: %f s" % training_time)
# Validation feed
val_feed = {x: mnist.validation.images,
y_: mnist.validation.labels}
val_xent = sess.run(cross_entropy, feed_dict=val_feed)
print("After %d training step(s), validation cross entropy = %g" %
(FLAGS.train_steps, val_xent))
if __name__ == "__main__":
tf.app.run()