官方 | TensorFlow 2.0分布式训练教程

最新推荐文章于 2022-05-05 19:27:50 发布

小白学视觉

最新推荐文章于 2022-05-05 19:27:50 发布

阅读量1k

点赞数

文章标签：人工智能 python 编程语言机器学习计算机视觉

原文链接：https://mp.weixin.qq.com/s?__biz=MzU0NjgzMDIxMQ==&mid=2247560156&idx=5&sn=26a0be651d09b05c9224cf8aaf9c19ad&chksm=fb540930cc238026acfe6b6f145b357885f864600e9fa2c12142b798abb7f2ffba26aa04f641&scene=126&&sessionid=0

版权

点击上方“小白学视觉”，选择加"星标"或“置顶”

重磅干货，第一时间送达

本文转自|计算机视觉联盟

总览

tf.distribute.Strategy是一个TensorFlow API，用于在多个GPU，多个计算机或TPU之间分配培训。使用此API，您可以在代码更改最少的情况下分发现有模型和培训代码。

设计tf.distribute.Strategy时要牢记以下关键目标：

易于使用并支持多个用户细分，包括研究人员，ML工程师等。

开箱即用地提供良好的性能。

轻松切换策略。

将tf.distribute.Strategy与Keras之类的高级API结合使用，也可以用于分发自定义训练循环（以及通常使用TensorFlow进行的任何计算）。

在TensorFlow 2.0中，您可以使用tf.function或在图形中急于执行程序。 tf.distribute.Strategy打算支持这两种执行方式。尽管我们在本指南中大部分时间都讨论培训，但是该API也可以用于在不同平台上分发评估和预测。

您可以使用tf.distribute.Strategy更改代码，因为我们已将TensorFlow的基础组件更改为可感知策略。这包括变量，层，模型，优化器，指标，摘要和检查点。

在本指南中，我们说明了各种类型的策略以及如何在不同情况下使用它们。

Types of strategies

tf.distribute.Strategy intends to cover a number of use cases along different axes. Some of these combinations are currently supported and others will be added in the future. Some of these axes are:

Synchronous vs asynchronous training: These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.
Hardware platform: You may want to scale your training onto multiple GPUs on one machine, or multiple machines in a network (with 0 or more GPUs each), or on Cloud TPUs.

In order to support these use cases, there are five strategies available. In the next section we explain which of these are supported in which scenarios in TF 2.0 at this time. Here is a quick overview:

Training API	MirroredStrategy	TPUStrategy	MultiWorkerMirroredStrategy	CentralStorageStrategy	ParameterServerStrategy	OneDeviceStrategy
Keras API	Supported	Experimental support	Experimental support	Experimental support	Supported planned post 2.0	Supported
Custom training loop	Experimental support	Experimental support	Support planned post 2.0	Support planned post 2.0	No support yet	Supported
Estimator API	Limited Support	Not supported	Limited Support	Limited Support	Limited Support	Limited Support

Note: Estimator support is limited. Basic training and evaluation are experimental, and advanced features—such as scaffold—are not implemented. We recommend using Keras or custom training loops if a use case is not covered.

MirroreStategy

tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying identical updates.

Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up, and makes them available on each device. It’s a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. There are many all-reduce algorithms and implementations available, depending on the type of communication available between devices. By default, it uses NVIDIA NCCL as the all-reduce implementation. You can choose from a few other options we provide, or write your own.

Here is the simplest way of creating MirroredStrategy:

mirrored_strategy = tf.distribute.MirroredStrategy()

This will create a MirroredStrategy instance which will use all the GPUs that are visible to TensorFlow, and use NCCL as the cross device communication.

If you wish to use only some of the GPUs on your machine, you can do so like this:

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:1,/job:localhost/replica:0/task:0/device:GPU:0

If you wish to override the cross device communication, you can do so using the cross_device_ops argument by supplying an instance of tf.distribute.CrossDeviceOps. Currently, tf.distribute.HierarchicalCopyAllReduceand tf.distribute.ReductionToOneDevice are two options other than tf.distribute.NcclAllReduce which is the default.

mirrored_strategy = tf.distribute.MirroredStrategy(
    cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

CentralStorageStrategy

tf.distribute.experimental.CentralStorageStrategy does synchronous training as well. Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.

Create an instance of CentralStorageStrategy by:

central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()

INFO:tensorflow:ParameterServerStrategy with compute_devices = ('/device:GPU:0',), variable_device = '/device:GPU:0'

This will create a CentralStorageStrategy instance which will use all visible GPUs and CPU. Update to variables on replicas will be aggregated before being applied to variables.

Note: This strategy is experimental as we are currently improving it and making it work for more scenarios. As part of this, please expect the APIs to change in the future.

MultiWorkerMirroredStrategy

tf.distribute.experimental.MultiWorkerMirroredStrategy is very similar to MirroredStrategy. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to MirroredStrategy, it creates copies of all variables in the model on each device across all workers.

It uses CollectiveOps as the multi-worker all-reduce communication method used to keep variables in sync. A collective op is a single op in the TensorFlow graph which can automatically choose an all-reduce algorithm in the TensorFlow runtime according to hardware, network topology and tensor sizes.

It also implements additional performance optimizations. For example, it includes a static optimization that converts multiple all-reductions on small tensors into fewer all-reductions on larger tensors. In addition, we are designing it to have a plugin architecture - so that in the future, you will be able to plugin algorithms that are better tuned for your hardware. Note that collective ops also implement other collective operations such as broadcast and all-gather.

Here is the simplest way of creating MultiWorkerMirroredStrategy:

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Single-worker CollectiveAllReduceStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.AUTO

MultiWorkerMirroredStrategy currently allows you to choose between two different implementations of collective ops. CollectiveCommunication.RING implements ring-based collectives using gRPC as the communication layer.CollectiveCommunication.NCCL uses Nvidia's NCCL to implement collectives. CollectiveCommunication.AUTOdefers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster. You can specify them in the following way:

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    tf.distribute.experimental.CollectiveCommunication.NCCL)

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Single-worker CollectiveAllReduceStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.NCCL

One of the key differences to get multi worker training going, as compared to multi-GPU training, is the multi-worker setup. The TF_CONFIG environment variable is the standard way in TensorFlow to specify the cluster configuration to each worker that is part of the cluster. Learn more about setting up TF_CONFIG.

Note: This strategy is experimental as we are currently improving it and making it work for more scenarios. As part of this, please expect the APIs to change in the future.

TPUStrategy

tf.distribute.experimental.TPUStrategy lets you run your TensorFlow training on Tensor Processing Units (TPUs). TPUs are Google's specialized ASICs designed to dramatically accelerate machine learning workloads. They are available on Google Colab, the TensorFlow Research Cloud and Cloud TPU.

In terms of distributed training architecture, TPUStrategy is the same MirroredStrategy - it implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in TPUStrategy.

Here is how you would instantiate TPUStrategy:

Note: To run this code in Colab, you should select TPU as the Colab runtime. We will have a tutorial soon that will demonstrate how you can use TPUStrategy.

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
    tpu=tpu_address)
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)

The TPUClusterResolver instance helps locate the TPUs. In Colab, you don't need to specify any arguments to it.

If you want to use this for Cloud TPUs: - You must specify the name of your TPU resource in the tpu argument. - You must initialize the tpu system explicitly at the start of the program. This is required before TPUs can be used for computation. Initializing the tpu system also wipes out the TPU memory, so it's important to complete this step first in order to avoid losing state.

Note: This strategy is experimental as we are currently improving it and making it work for more scenarios. As part of this, please expect the APIs to change in the future.

ParameterSeverStrategy

tf.distribute.experimental.ParameterServerStrategy supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

In terms of code, it looks similar to other strategies:

ps_strategy = tf.distribute.experimental.ParameterServerStrategy()

For multi worker training, TF_CONFIG needs to specify the configuration of parameter servers and workers in your cluster, which you can read more about in TF_CONFIG below below.

OneDeviceStrategy

tf.distribute.OneDeviceStrategy runs on a single device. This strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device. Moreover, any functions called via strategy.experimental_run_v2 will also be placed on the specified device.

You can use this strategy to test your code before switching to other strategies which actually distributes to multiple devices/machines.

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

So far we've talked about what are the different stategies available and how you can instantiate them. In the next few sections, we will talk about the different ways in which you can use them to distribute your training. We will show short code snippets in this guide and link off to full tutorials which you can run end to end.

Using tf.distribute.Strategy with Keras

We've integrated tf.distribute.Strategy into tf.keras which is TensorFlow's implementation of the Keras API specification. tf.keras is a high-level API to build and train models. By integrating into tf.keras backend, we've made it seamless for you to distribute your training written in the Keras training framework.

Here's what you need to change in your code:

Create an instance of the appropriate tf.distribute.Strategy
Move the creation and compiling of Keras model inside strategy.scope.

We support all types of Keras models - sequential, functional and subclassed.

Here is a snippet of code to do this for a very simple Keras model with one dense layer

END

好消息，小白学视觉团队的知识星球开通啦，为了感谢大家的支持与厚爱，团队决定将价值149元的知识星球现时免费加入。各位小伙伴们要抓住机会哦！

下载1：OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复：扩展模块中文教程，即可下载全网第一份OpenCV扩展模块教程中文版，涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。

下载2：Python视觉实战项目52讲

在「小白学视觉」公众号后台回复：Python视觉实战项目，即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目，助力快速学校计算机视觉。

下载3：OpenCV实战项目20讲

在「小白学视觉」公众号后台回复：OpenCV实战项目20讲，即可下载含有20个基于OpenCV实现20个实战项目，实现OpenCV学习进阶。

交流群

欢迎加入公众号读者群一起和同行交流，目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群（以后会逐渐细分），请扫描下面微信号加群，备注：”昵称+学校/公司+研究方向“，例如：”张三 + 上海交大 + 视觉SLAM“。请按照格式备注，否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告，否则会请出群，谢谢理解~