horovod使用_用horovod进行分布式模型训练

本文介绍了如何利用Horovod进行分布式模型训练,适用于机器学习和深度学习项目,提升模型训练效率。
摘要由CSDN通过智能技术生成

horovod使用

Distributed training is a set of techniques for using many GPUs located on many different machines for training your machine learning models. Distributed training is an increasingly common and important deep learning technique, as it enables the training of models too big or too cumbersome to manage on a single machine.

分布式训练是一组技术,用于使用位于许多不同机器上的许多GPU来训练机器学习模型。 分布式训练是一种越来越普遍和重要的深度学习技术,因为它使训练太大或太麻烦而无法在一台机器上管理的模型成为可能。

In a previous article, “Distributed model training in PyTorch using DistributedDataParallel”, I covered distributed training in PyTorch using the native DistributedDataParallel API (if you are unfamiliar with distributed training in general, start there!). This follow-up post discusses distributed training using Uber’s Horovod library. Horovod is a cross-platform distributed training API (supports PyTorch, TensorFlow, Keras, and MXNet) designed for both ease-of-use and performance. Horovod has some really impressive integrations: for example, you can run it within Spark. And it boasts some pretty impressive results:

在上一篇文章“ 使用DistributedDataParallel在PyTorch中进行分布式模型训练 ”中,我介绍了使用本机DistributedDataParallel API在PyTorch中进行分布式训练(如果您通常不熟悉分布式训练,请从这里开始!)。 这篇后续文章讨论了使用Uber的Horovod库进行的分布式培训。 Horovod是一种跨平台的分布式培训API(支持PyTorch,TensorFlow,Keras和MXNet),旨在兼顾易用性和性能。 Horovod具有一些非常令人印象深刻的集成:例如, 您可以在Spark中运行它 。 它拥有一些令人印象深刻的结果:

Image for post
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowMeet Horovod:Uber的用于TensorFlow的开源分布式深度学习框架 ”的图片

In this post, I will demonstrate distributed model training using Horovod. We will cover the basics of distributed training, then benchmark a real training script to see the library in action.

在本文中,我将演示如何使用Horovod进行分布式模型训练。 我们将介绍分布式培训的基础知识,然后对真实的培训脚本进行基准测试,以查看实际的库。

Click here to go to our examples GitHub repo so that you can follow along in code.

单击此处转到我们的示例GitHub存储库,以便您可以继续执行代码。

Note that this blog post assumes some familiarity with distributed training. While I provide a quick overview in the next section, if you are unfamiliar with the idea, I recommend first skimming the more detailed overview in my previous post, “Scaling model training in PyTorch using distributed data parallel”.

请注意,此博客文章假定您对分布式培训有所了解。 我将在下一部分中提供快速概述,但是如果您不熟悉该主意,建议您先浏览一下我以前的文章“ 使用分布式数据并行在PyTorch中扩展模型训练 ”中的更详细的概述。

这个怎么运作 (How it works)

Horovod distributes training across GPUs using the data parallelization strategy. In data parallelization, each GPU in the job receives its own independent slice of the data batch, e.g. its own “batch slice”. Each GPU uses this data to independently calculate a gradient update. For example, if you were to use two GPUs and a batch size of 32, one GPU would handle forward and back propagation on the first 16 records, and the second, the last 16. These gradient updates are then synchronized among the GPUs, averaged together, and finally applied to the model.

Horovod使用数据并行化策略在GPU上分配培训。 在数据并行化中,作业中的每个GPU都会接收其自己的数据批处理的独立切片,例如其自己的“批处理切片”。 每个GPU都使用此数据独立计算梯度更新。 例如,如果要使用两个GPU,批处理大小为32,则一个GPU将处理前16条记录的正向传播和向后传播,以及第二条最后16条记录的正向传播。然后,这些梯度更新将在GPU之间平均在一起,最后应用于模型。

Here’s how it works, step-by-step:

逐步操作方法如下:

  1. Each worker maintains its own copy of the model weights and its own copy of the dataset.

    每个工作人员维护自己的模型权重副本和自己的数据集副本。
  2. Upon receiving the go signal, each worker process draws a disjoint batch from the dataset and computes a gradient for that batch.

    收到执行信号后,每个工作进程都会从数据集中提取一个不相交的批次,并计算该批次的梯度。
  3. The workers use the ring all-reduce algorithm to synchronize their individual gradients, computing the same average gradient on all nodes locally.

    工人们使用环全归约算法来同步他们的各个梯度,从而在本地所有节点上计算相同的平均梯度。
  4. Each worker applies the gradient update to its local copy of the model.

    每个工作人员将渐变更新应用于其模型的本地副本。
  5. The next batch of training begins.

    下一批培训开始。

The tricky part is maintaining consistency between the different copies of the model across the different GPUs. If the different model copies somehow end up with different weights, weight updates will become inconsistent and model training will diverge.

棘手的部分是在不同GPU之间维护模型的不同副本之间的一致性。 如果不同的模型以某种方式最终获得不同的权重,则权重更新将变得不一致,并且模型训练将有所不同。

This is achieved using a technique borrowed from the high-performance computing world: the ring all-reduce algorithm. This diagram, taken from Horovod’s launch post, demonstrates how it works:

这是使用从高性能计算世界借来的技术来实现的: 环形全归约算法 。 此图取自Horovod的发布帖子 ,展示了其工作原理:

Image for post
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值