High performance models in TensorFlow

最新推荐文章于 2024-08-21 09:35:24 发布

__Sunny__

最新推荐文章于 2024-08-21 09:35:24 发布

阅读量3.1k

点赞数

分类专栏： TensorFlow 文章标签： TensorFlow

TensorFlow 专栏收录该内容

60 篇文章 1 订阅

订阅专栏

转自：http://www.tuicool.com/articles/6jAnyiy

原文：https://www.tensorflow.org/performance/performance_models

This document and accompanying scripts detail how to build highly scalable models that target a variety of system types and network topologies. The techniques in this document utilize some low-level TensorFlow Python primitives. In the future, many of these techniques will be incorporated into high-level APIs.

本文档和随附的脚本详细介绍了如何构建针对各种系统类型和网络拓扑的高度可扩展的模型。本文档中的技术使用一些低级TensorFlow Python基元。将来，许多技术将被并入高级API。

Input Pipeline

The Performance Guide explains how to identify possible input pipeline issues and best practices. We found that using tf.FIFOQueue and tf.train.queue_runner could not saturate multiple current generation GPUs when using large inputs and processing with higher samples per second, such as training ImageNet with AlexNet . This is due to the use of Python threads as its underlying implementation. The overhead of Python threads is too large.

“性能指南”介绍了如何识别输入管道可能存在的问题和最佳实践。我们发现，当有大量输入和每秒更高的采样率处理（如使用AlexNet训练ImageNet）时，使用tf.FIFOQueue和tf.train.queue_runner，不能使多个目前这一代的GPU饱和。这是由于使用Python线程作为其底层实现。 Python线程的开销太大了。

Another approach, which we have implemented in the scripts , is to build an input pipeline using the native parallelism in TensorFlow. Our implementation is made up of 3 stages:

I/O reads: Choose and read image files from disk.
Image Processing: Decode image records into images, preprocess, and organize into mini-batches.
CPU-to-GPU Data Transfer: Transfer images from CPU to GPU.

我们在脚本中实现的另一种方法是使用TensorFlow中的本机并行构建输入管道。我们的方法由3个阶段组成：

I/O 读取：从磁盘中选择和读取图像文件。
图像处理：将图像记录解码为图像，预处理和组织成小批量。
CPU到GPU数据传输：将图像从CPU传输到GPU。

The dominant part of each stage is executed in parallel with the other stages usingdata_flow_ops.StagingArea . StagingArea is a queue-like operator similar to tf.FIFOQueue . The difference is that StagingArea offers simpler functionality and can be executed on both CPU and GPU in parallel with other stages. Breaking the input pipeline into 3 stages that operate independently in parallel is scalable and takes full advantage of large multi-core environments. The rest of this section details the stages followed by details about usingdata_flow_ops.StagingArea .

使用data_flow_ops.StagingArea，每个阶段的主要部分是与其他阶段并行执行。StagingArea是类似于tf.FIFOQueue的类似队列的操作符。不同之处在于，StagingArea提供了更简单的功能，并且可以在CPU和GPU上与其他阶段并行执行。将输入管道分解成3个可并行运行的阶段是可扩展的，并充分利用大型多核环境。本节的其余部分将详细介绍有关使用data_flow_ops.StagingArea 的详细信息。

Parallelize I/O Reads 并行I/O读取

data_flow_ops.RecordInput is used to parallelize reading from disk. Given a list of input files representing TFRecords, RecordInput continuously reads records using background threads. The records are placed into its own large internal pool and when it has loaded at least half of its capacity, it produces output tensors.

This op has its own internal threads that are dominated by I/O time that consume minimal CPU, which allows it to run smoothly in parallel with the rest of the model.

data_flow_ops.RecordInput 用于从磁盘中并行读取。给定一个表示TFRecords的输入文件的列表，RecordInput 使用后台线程连续读取记录。这些记录被放置在大型内部池中，并且当它们已经加载了其容量的至少一半时，它产生输出张量。

这个操作有自己的内部线程，主要时间为占用最少CPU的 I/O时间，这就允许它与模型的其余部分平行运行。

Parallelize Image Processing 并行图像处理

After images are read from RecordInput they are passed as tensors to the image processing pipeline. To make the image processing pipeline easier to explain, assume that the input pipeline is targeting 8 GPUs with a batch size of 256 (32 per GPU).

256 records are read and processed individually in parallel. This starts with 256 independentRecordInput read ops in the graph. Each read op is followed by an identical set of ops for image preprocessing that are considered independent and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion, and resizing.

从RecordInput读取图像后，它们作为张量传递到图像处理流水线。为了使图像处理流水线更容易解释，假设输入流水线的目标是8个GPU，批量大小为256（每GPU 32个）。

并行读取和处理256个记录。这从图中的256个独立的RecordInput读操作开始。每个读操作后跟一组相同的用于图像预处理的操作，它们被认为是独立的并行执行。图像预处理操作包括诸如图像解码，失真和调整大小的操作。

Once the images are through preprocessing, they are concatenated together into 8 batch size 32 tensors. Rather than use tf.concat for this purpose, which is implemented as a single op that waits for all the inputs to be ready before concatenating them together, tf.parallel_stackis used. tf.parallel_stack allocates an uninitialized tensor as an output, and each input tensor is written to its designated portion of the output tensor as soon as the input is available.

When all the input tensors are finished, the output tensor is passed along in the graph. This effectively hides all the memory latency with the long tail of producing all the input tensors.

一旦图像通过预处理，它们就被连接成8个批量32张张量。不是为了这个目的使用tf.concat，而是将其作为一个单独的操作实现，等待所有输入准备就绪，然后将它们连接在一起，就可以使用tf.parallel_stack。 tf.parallel_stack将未初始化的张量分配为输出，并且一旦输入可用，则每个输入张量被写入输出张量的指定部分。

当所有输入张量完成时，输出张量在图中传递。这有效地隐藏了产生所有输入张量的长尾的所有内存延迟。

Parallelize CPU-to-GPU Data Transfer 并行从CPU到GPU的数据传输

Continuing with the assumption that the target is 8 GPUs with a batch size of 256 (32 per GPU). Once the input images are processed and concatenated together by the CPU, we have 8 tensors each with a batch-size of 32.

TensorFlow enables tensors from one device to be used on any other device directly. TensorFlow inserts implicit copies to make the tensors available on any devices where they are used. The runtime schedules the copy between devices to run before the tensors are actually used. However, if the copy cannot finish in time, the computation that needs those tensors will stall and result in decreased performance.

In this implementation, data_flow_ops.StagingArea is used to explicitly schedule the copy in parallel. The end result is that when computation starts on the GPU, all the tensors are already available.

继续假设目标是8个GPU，批量大小为256（每GPU 32个）。一旦输入图像被CPU处理并连接在一起，我们就有8个张量，每个标签的批量大小为32。

TensorFlow可以使一个设备的张量直接在任何其他设备上使用。 TensorFlow插入隐式副本，使张量在使用它们的任何设备上可用。在实际使用张量之前，运行时会在设备之间调度副本以运行。然而，如果副本无法及时完成，则需要这些张量的计算将停止并导致性能下降。

在此实现中，data_flow_ops.StagingArea用于并行显式调度副本。最终的结果是当GPU开始计算时，所有的张量都已经可用了。

Software Pipelining 软件管道

With all the stages capable of being driven by different processors,data_flow_ops.StagingArea is used between them so they run in parallel. StagingArea is a queue-like operator similar to tf.FIFOQueue that offers simpler functionalities that can be executed on both CPU and GPU.

Before the model starts running all the stages, the input pipeline stages are warmed up to prime the staging buffers in between with one set of data. During each run step, one set of data is read from the staging buffers at the beginning of each stage, and one set is pushed at the end.

由于所有阶段都能够被不同的处理器驱动，所以在它们之间使用data_flow_ops.StagingArea，以便它们并行运行。 StagingArea是类似于tf.FIFOQueue的类似队列的操作，它提供了可以在CPU和GPU上执行的更简单的功能。

在模型开始运行所有阶段之前，输入流水线阶段将被加热，以将其间的分段缓冲区置于一组数据之间。在每个运行步骤中，在每个阶段的开始处，从分段缓冲器中读取一组数据，最后一个数据被推送。

For example: if there are three stages: A, B and C. There are two staging areas in between: S1 and S2. During the warm up, we run:

例如：如果有三个阶段：A，B和C。之间有两个分段区域 S1 和 S2。在预热时，我们运行：

Warm up:
Step 1: A0
Step 2: A1  B0

Actual execution:
Step 3: A2  B1  C0
Step 4: A3  B2  C1
Step 5: A4  B3  C2

After the warm up, S1 and S2 each have one set of data in them. For each step of the actual execution, one set of data is consumed from each staging area, and one set is added to each.

预热后，S1和S2各有一组数据。对于实际执行的每个步骤，从每个暂存区域消耗一组数据，并将一组数据添加到每个。

Benefits of using this scheme:

All stages are non-blocking, since the staging areas always have one set of data after the warm up.
Each stage can run in parallel since they can all start immediately.
The staging buffers have a fixed memory overhead. They will have at most one extra set of data.
Only a single session.run() call is needed to run all stages of the step, which makes profiling and debugging much easier.

使用此方案的好处：

所有阶段都是非阻塞的，因为在热身之后，分段区域总是具有一组数据。
每个阶段都可以并行运行，因为它们都可以立即启动。
分级缓冲区具有固定的内存开销。他们将至多有一组额外的数据。
只需要一个单独的session.run（）调用来运行步骤的所有阶段，这使得分析和调试更容易。

Best Practices in Building High-Performance Models 建立高性能模型的最佳实践

Collected below are a couple of additional best practices that can improve performance and increase the flexiblity of models.

以下收集的是一些额外的最佳实践，可以提高性能并提高模型的灵活性。

Build the model with both NHWC and NCHW 用NHWC和NCHW建立模型

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. On GPU, NCHW is faster. But on CPU, NHWC is sometimes faster.

Building a model to support both date formats keeps the model flexible and capable of operating optimally regardless of platform. Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. The benchmark script was written to support both NCHW and NHWC. NCHW should always be used when training with GPUs. NHWC is sometimes faster on CPU. A flexible model can be trained on GPUs using NCHW with inference done on CPU using NHWC with the weights obtained from training.

CNN使用的大多数TensorFlow操作都支持NHWC和NCHW数据格式。在GPU上，NCHW更快。但是在CPU上，NHWC有时更快。

建立一个支持日期格式的模型可以保持模型的灵活性，无论平台如何，都能够最佳地运行。 CNN使用的大多数TensorFlow操作都支持NHWC和NCHW数据格式。基准脚本是为了支持NCHW和NHWC而编写的。在使用GPU进行培训时，应始终使用NCHW。 NHWC有时在CPU上更快。可以使用NCHW在GPU上使用NHWC进行推理并从训练获得的权重对GPU进行灵活的模型训练。

Use Fused Batch-Normalization 使用融合批处理标准化

The default batch-normalization in TensorFlow is implemented as composite operations. This is very general, but often leads to suboptimal performance. An alternative is to use fused batch-normalization which often has much better performance on GPU. Below is an example of usingtf.contrib.layers.batch_norm to implement fused batch-normalization.

TensorFlow中的默认批处理规范化被实现为复合操作。这是很通用的，但往往导致次优的表现。另一种方法是使用融合批量标准化，这在GPU上经常具有更好的性能。以下是使用tf.contrib.layers.batch_norm来实现融合批处理的一个例子。

bn = tf.contrib.layers.batch_norm(
          input_layer, fused=True, data_format='NCHW'
          scope=scope)

Variable Distribution and Gradient Aggregation 可变分布和梯度聚合

During training, training variable values are updated using aggregated gradients and deltas. In the benchmark script, we demonstrate that with the flexible and general-purpose TensorFlow primitives, a diverse range of high-performance distribution and aggregation schemes can be built.

Three examples of variable distribution and aggregation were included in the script:

在培训期间，培训变量值使用聚合渐变和三角形更新。在基准脚本中，我们展示了使用灵活和通用的TensorFlow原语，可以构建各种各样的高性能分布和聚合方案。

脚本中包含三个可变分布和聚合的示例：

parameter_server where each replica of the training model reads the variables from a parameter server and updates the variable independently. When each model needs the variables, they are copied over through the standard implicit copies added by the TensorFlow runtime. The example script illustrates using this method for local training, distributed synchronous training, and distributed asynchronous training.
replicated places an identical copy of each training variable on each GPU. The forward and backward computation can start immediately as the variable data is immediately available. Gradients are accumulated across all GPUs, and the aggregated total is applied to each GPU's copy of the variables to keep them in sync.
distributed_replicated places an identical copy of the training parameters on each GPU along with a master copy on the parameter servers. The forward and backward computation can start immediately as the variable data is immediately available. Gradients are accumulated across all GPUs on each server and then the per-server aggregated gradients are applied to the master copy. After all workers do this, each worker updates its copy of the variable from the master copy.
parameter_server其中训练模型的每个副本从参数服务器读取变量并独立地更新变量。当每个模型需要变量时，它们将通过TensorFlow运行时添加的标准隐式副本进行复制。示例脚本说明了使用此方法进行本地训练，分布式同步训练和分布式异步训练。
复制在每个GPU上放置每个训练变量的相同副本。随着可变数据立即可用，正向和反向计算可以立即开始。所有GPU中都会累积渐变，并将累计总数应用于每个GPU的变量副本，以使其保持同步。
distributed_replicated将每个GPU上的训练参数的相同副本与参数服务器上的主副本一起放置。随着可变数据立即可用，正向和反向计算可以立即开始。梯度在每个服务器上的所有GPU中累积，然后将每个服务器的聚合渐变应用于主副本。所有工作人员都执行此操作后，每个工作人员将从主副本更新其变量的副本。

Below are additional details about each approach.

以下是有关每种方法的其他详细信息。

Parameter Server Variables 参数服务器变量

The most common way trainable variables are managed in TensorFlow models is parameter server mode.

In a distributed system, each worker process runs the same model, and parameter server processes own the master copies of the variables. When a worker needs a variable from a parameter server, it refers to it directly. The TensorFlow runtime adds implicit copies to the graph to make the variable value available on the computation device that needs it. When a gradient is computed on a worker, it is sent to the parameter server that owns the particular variable, and the corresponding optimizer is used to update the variable.

There are some techniques to improve throughput:

TensorFlow模型中可管理变量的最常见方式是参数服务器模式。

在分布式系统中，每个工作进程运行相同的模型，参数服务器进程拥有变量的主副本。当一个工作者需要一个参数服务器的变量时，它直接引用它。 TensorFlow运行时会将隐式副本添加到图形中，使变量值在需要它的计算设备上可用。当在工作者上计算梯度时，将其发送到拥有特定变量的参数服务器，并使用相应的优化程序更新变量。

有一些提高吞吐量的技术：

The variables are spread among parameter servers based on their size, for load balancing.
When each worker has multiple GPUs, gradients are accumulated across the GPUs and a single aggregated gradient is sent to the parameter server. This reduces the network bandwidth and the amount of work done by the parameter servers.
这些变量根据其大小在参数服务器之间进行扩展，用于负载平衡。
当每个工作人员有多个GPU时，每个GPU都会累积梯度，并将一个聚合梯度发送到参数服务器。这减少了网络带宽和参数服务器完成的工作量。

For coordinating between workers, a very common mode is async updates, where each worker updates the master copy of the variables without synchronizing with other workers. In our model, we demonstrate that it is fairly easy to introduce synchronization across workers so updates for all workers are finished in one step before the next step can start.

The parameter server method can also be used for local training, In this case, instead of spreading the master copies of variables across parameters servers, they are either on the CPU or spread across the available GPUs.

Due to the simple nature of this setup, this architecture has gained a lot of popularity within the community.

This mode can be used in the script by passing --variable_update=parameter_server .

对于workers之间的协调，一个非常常见的模式是异步更新，每个worker更新变量的主副本，而不与其他workers同步。在我们的模型中，我们证明在workers之间引入同步是相当容易的，所以在下一步开始之前，所有workers的更新将一步完成。

参数服务器方法也可以用于本地训练，在这种情况下，不是在参数服务器之间传播变量的主副本，而是在CPU上或分布在可用的GPU上。

由于这种设置的简单性，这种架构在社区内获得了很多的普及。

通过传递--variable_update=parameter_server可以在脚本中使用此模式

Replicated Variables 复制变量

In this design, each GPU on the server has its own copy of each variable. The values are kept in sync across GPUs by applying the fully aggregated gradient to each GPU's copy of the variable.

The variables and data are available at the start of training, so the forward pass of training can start immediately. Gradients are aggregated across the devices and the fully aggregated gradient is then applied to each local copy.

Gradient aggregation across the server can be done in different ways:

Using standard TensorFlow operations to accumulate the total on a single device (CPU or GPU) and then copy it back to all GPUs.
Using NVIDIA® NCCL, described below in the NCCL section.

This mode can be used in the script by passing --variable_update=replicated .

在这个设计中，服务器上的每个GPU都有自己的每个变量的副本。通过将完全聚合的渐变应用于每个GPU的变量副本，这些值在GPU之间保持同步。

变量和数据在培训开始时可用，所以训练的前进通过即可开始。梯度在设备之间进行聚合，然后将完全聚合的渐变应用于每个本地副本。

服务器上的渐变聚合可以通过不同的方式完成：

使用标准TensorFlow操作在单个设备（CPU或GPU）上累积总数，然后将其复制回所有GPU。
使用NVIDIA®NCCL，如下面NCCL部分所述。

通过传递--variable_update = replicated可以在脚本中使用此模式。

Replicated Variables in Distributed Training 分布式训练中的复制变量

The replicated method for variables can be extended to distributed training. One way to do this like the replicated mode: aggregate the gradients fully across the cluster and apply them to each local copy of the variable. This may be shown in a future version of this scripts; the scripts do present a different variation, described here.

In this mode, in addition to each GPU's copy of the variables, a master copy is stored on the parameter servers. As with the replicated mode, training can start immediately using the local copies of the variables.

变量的复制方法可以扩展到分布式培训。一种像复制模式一样的方法：将集群中的渐变聚合并将其应用于变量的每个本地副本。这可能会在此脚本的未来版本中显示; 脚本确实呈现出不同的变化，这里描述。

在这种模式下，除了每个GPU的变量副本之外，主副本也存储在参数服务器上。与复制模式一样，训练可以立即使用变量的本地副本开始。

As the gradients of the weights become available, they are sent back to the parameter servers and all local copies are updated:

All the gradients from the GPU on the same worker are aggregated together.
Aggregated gradients from each worker are sent to the parameter server that owns the variable, where the specified optimizer is used to update the master copy of the variable.
Each worker updates its local copy of the variable from the master. In the example model, this is done with a cross-replica barrier that waits for all the workers to finish updating the variables, and fetches the new variable only after the barrier has been released by all replicas. Once the copy finishes for all variables, this marks the end of a training step, and a new step can start.

随着权重的梯度可用，它们将被发送回参数服务器，并且所有本地副本都被更新：

来自同一工作人员的GPU的所有渐变都聚合在一起。
来自每个工作者的聚合渐变被发送到拥有变量的参数服务器，其中使用指定的优化器来更新变量的主副本。
每个工作人员从主机更新变量的本地副本。在示例模型中，这是通过等待所有工作人员完成更新变量的交叉副本屏障完成的，并且只有在所有副本发布屏障之后才获取新变量。一旦复制完成所有变量，这标志着培训步骤的结束，一个新的步骤可以开始。

Although this sounds similar to the standard use of parameter servers, the performance is often better in many cases. This is largely due to the fact the computation can happen without any delay, and much of the copy latency of early gradients can be hidden by later computation layers.

This mode can be used in the script by passing --variable_update=distributed_replicated .

虽然这听起来类似于参数服务器的标准使用，但在许多情况下，性能往往更好。这主要是由于计算可以没有任何延迟而发生的事实，早期梯度的大部分复制延迟可以被稍后的计算层隐藏。

通过传递--variable_update = distributed_replicated可以在脚本中使用此模式。

NCCL

In order to broadcast variables and aggregate gradients across different GPUs within the same host machine, we can use the default TensorFlow implicit copy mechanism.

However, we can instead use the optional NCCL ( tf.contrib.nccl ) support. NCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across different GPUs. It schedules a cooperating kernel on each GPU that knows how to best utilize the underlying hardware topology; this kernel uses a single SM of the GPU.

In our experiment, we demonstrate that although NCCL often leads to much faster data aggregation by itself, it doesn't necessarily lead to faster training. Our hypothesis is that the implicit copies are essentially free since they go to the copy engine on GPU, as long as its latency can be hidden by the main computation itself. Although NCCL can transfer data faster, it takes one SM away, and adds more pressure to the underlying L2 cache. Our results show that for 8-GPUs, NCCL often leads to better performance. However, for fewer GPUs, the implicit copies often perform better.

为了在同一主机内的不同GPU上广播变量和聚合梯度，我们可以使用默认的TensorFlow隐式复制机制。

但是，我们可以使用可选的NCCL（tf.contrib.nccl）支持。 NCCL是一个NVIDIA®库，可以跨不同的GPU高效地广播和汇总数据。它在每个GPU上安排一个合作的内核，知道如何最好地利用底层硬件拓扑;该内核使用单个SM的GPU。

在我们的实验中，我们展示了尽管NCCL本身通常导致更快的数据聚合，但并不一定会导致更快的培训。我们的假设是，隐式副本基本上是免费的，因为它们在GPU上复制引擎，只要它的延迟可以被主计算本身隐藏起来。尽管NCCL可以更快地传输数据，但是它需要一个SM，并为底层的L2缓存增加了更多的压力。我们的研究结果表明，对于8-GPU，NCCL通常导致更好的性能。然而，对于较少的GPU，隐式副本通常性能更好。

Staged Variables 分段变量

We further introduce a staged-variable mode where we use staging areas for both the variable reads, and their updates. Similar to software pipelining of the input pipeline, this can hide the data copy latency. If the computation time takes longer than the copy and aggregation, the copy itself becomes essentially free.

The downside is that all the weights read are from the previous training step. So it is a different algorithm from SGD. But it is possible to improve its convergence by adjusting learning rate and other hyperparameters.

我们进一步介绍了一种分段变量模式，我们使用分段区域进行变量读取及其更新。类似于输入管道的软件流水线，这可以隐藏数据复制延迟。如果计算时间比复制和聚合时间长，复制本身基本上是空闲的。

缺点是读取的所有权重都来自以前的培训步骤。所以这是一个不同于SGD的算法。但是通过调整学习率和其他超参数可以提高收敛性。

Executing the script 执行脚本

This section lists the core command line arguments and a few basic examples for executing the main script ( tf_cnn_benchmarks.py ).

本节列出了执行主脚本的核心命令行参数和一些基本示例( tf_cnn_benchmarks.py ).

Base command line arguments

model : Model to use, e.g. resnet50 , inception3 , vgg16 , and alexnet .
num_gpus : Number of GPUs to use.
data_dir : Path to data to process. If not set, synthetic data is used. To use Imagenet data use these [instructions(https://github.com/tensorflow/tensorflow/blob/master/tensorflow_models/inception#getting-started) as a starting point.
batch_size : Batch size for each GPU.
variable_update : The method for managing variables: parameter_server , replicated ,distributed_replicated , independent
local_parameter_device : Device to use as parameter server: cpu or gpu .

Single instance examples

# VGG16 training ImageNet with 8 GPUs using arguments that optimize for
# Google Compute Engine.
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=8 \
--batch_size=32 --model=vgg16 --data_dir=/home/ubuntu/imagenet/train \
--variable_update=parameter_server --nodistortions

# VGG16 training synthetic ImageNet data with 8 GPUs using arguments that
# optimize for the NVIDIA DGX-1.
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=vgg16 --variable_update=replicated --use_nccl=True

# VGG16 training ImageNet data with 8 GPUs using arguments that optimize for
# Amazon EC2.
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=vgg16 --variable_update=parameter_server

# ResNet-50 training ImageNet data with 8 GPUs using arguments that optimize for
# Amazon EC2.
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=replicated --use_nccl=False

Distributed command line arguments

ps_hosts : Comma separated list of hosts to use as parameter servers in the format of<host>:port , e.g. 10.0.0.2:50000 .
worker_hosts : Comma separated list of hosts to use as workers in the format of<host>:port , e.g. 10.0.0.2:50001 .
task_index : Index of the host in the list of ps_hosts or worker_hosts being started.
job_name : Type of job, e.g ps or worker

Distributed examples

Below is an example of training ResNet-50 on 2 hosts: host_0 (10.0.0.1) and host_1 (10.0.0.2). The example uses synthetic data. To use real data pass the --data_dir argument.

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1