Tensorflow--分布式训练

最新推荐文章于 2020-06-03 22:00:09 发布

置顶

Never-guess

最新推荐文章于 2020-06-03 22:00:09 发布

阅读量833

点赞数

分类专栏：深度学习文章标签： python cuda 分布式集群 gpu

本文链接：https://blog.csdn.net/qq_20791919/article/details/78923439

版权

本文介绍了如何在Tesla K20c集群上进行多节点、多GPU的TensorFlow分布式训练。实验中，配置了一个CPU节点作为参数服务器，两个GPU节点各使用2个GPU作为工作服务器，探讨了同步和异步两种分布式训练模式。

摘要由CSDN通过智能技术生成

实验任务:

集群上多节点多GPU分布式训练

CUDA_VISIBLE_DEVICES=” python distributed.py –job_name=ps –task_index=0

CUDA_VISIBLE_DEVICES=’0’ python distributed.py –job_name=worker –task_index=0

CUDA_VISIBLE_DEVICES=’1’ python distributed.py –job_name=worker –task_index=1

CUDA_VISIBLE_DEVICES=’0’ python distributed.py –job_name=worker –task_index=2

CUDA_VISIBLE_DEVICES=’1’ python distributed.py –job_name=worker –task_index=3

实验环境:

TeslaK20c集群,使用了3个节点，其中1个节点使用1个cpu作为参数服务器,2个节点分别使用2个gpu作为工作服务器，分布式训练方式可以选择同步和异步两种。

# encoding:utf-8
import math
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

flags = tf.app.flags

flags.DEFINE_string('data_dir', '/home/zhangzhaoyu/incubator-mxnet-master/example/image-classification/data', 'Directory  for storing mnist data')
flags.DEFINE_integer('hidden_units', 100, 'Number of units in the hidden layer of the NN')
flags.DEFINE_integer('train_steps', 100000, 'Number of training steps to perform')
flags.DEFINE_integer('batch_size', 100, 'Training batch size ')
flags.DEFINE_float('learning_rate', 0.01, 'Learning rate')

flags.DEFINE_string('ps_hosts', '172.16.1.182:2222', 'Comma-separated list of hostname:port pairs')

flags.DEFINE_string('worker_hosts', '172.16.1.183:2223,172.16.1.183:2224,172.16.1.187:2225,172.16.1.187:2226',
                    'Comma-separated list of hostname:p

最低0.47元/天解锁文章

Never-guess

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tensorflow--分布式训练

实验任务:集群上多节点多GPU分布式训练CUDA_VISIBLE_DEVICES=” python distributed.py –job_name=ps –task_index=0CUDA_VISIBLE_DEVICES=’0’ python distributed.py –job_name=worker –task_index=0CUDA_VISIBLE_DEVICES=’1’ python d
复制链接

扫一扫

专栏目录