Horovod原理及实现细节_horovod.torch-CSDN博客

本文链接：https://blog.csdn.net/weixin_34275246/article/details/115550387

并行训练介绍

按照并行方式，分布式训练一般分为数据并行和模型并行两种，当然也有数据并行和模型并行的混合模式。

模型并行：分布式系统中的不同 GPU 负责网络模型的不同部分。例如，神经网络模型的不同网络层被分配到不同的 GPU，或者同一层内部的不同参数被分配到不同 GPU；
数据并行：不同的 GPU 有同一个模型的多个副本，每个 GPU 分配到不同的数据，然后将所有 GPU 的计算结果按照某种方式合并。

因为模型并行各个部分存在一定的依赖，规模伸缩性差（意思是不能随意增加 GPU 的数量），在实际训练中用的不多。而数据并行，则各部分独立，规模伸缩性好，实际训练中更为常用，提速效果也更好。

数据并行会涉及到各个 GPU 之间同步模型参数，一般分为同步更新和异步更新。同步更新要等到所有 GPU 的梯度计算完成，再统一计算新权值，然后所有 GPU 同步新值后，才进行下一轮计算。异步更新，每个 GPU 梯度计算完后，无需等待其他 GPU 的梯度计算（有时可以设置需要等待的梯度个数），可立即更新整体权值，然后同步此权值，即可进行下一轮计算。同步更新有等待，异步更新基本没有等待，但异步更新涉及到梯度过时等更复杂问题。

在实际应用中，单机多卡的同步式数据并行是最常用的，在论文中最常见的训练方式是单机八卡。数据再多时，一般就需要多机多卡了。

本文来源：码农网本文链接：https://www.codercto.com/a/72780.html

TensorFlow 集群

TensorFlow 的集群架构是 Parameter Server （参数服务器）架构，在计算单元以外加设新的服务器叫做参数服务器。每次训练的时候每个计算单元把梯度发送给参数服务器，服务器把他们进行汇总计算平均值，把平均值返回到每个计算单元，这样每个计算单元就同步了。
在这里插入图片描述

由于 TensorFlow 集群太不友好，业内也一直在尝试新的集群方案。

缺点：

概念多，学习曲线陡峭
修改的代码量大
需要多台机子跑不同的脚本
PS 和 Worker 的比例不好选取
性能损失较大

2017 年 Facebook 发布了《Accurate, large minibatch SGD: Training ImageNet in 1 hour 》验证了大数据并行的高效性，同年百度发表了《Bringing HPC techniques to deep learning 》，验证了全新的梯度同步和权值更新算法的可行性。受这两篇论文的启发，Uber 开发了 Horovod 集群方案。

ring-allreduce

百度 2017 年发表的《Bringing HPC techniques to deep learning 》中，采用了全新的梯度同步和权值同步算法，叫做 ring-allreduce。此种算法各个节点之间只与相邻的两个节点通信，并不需要参数服务器。因此，所有节点都参与计算也参与存储。

ring-allreduce的做法是把每个计算单元构建成一个环，要做梯度平均的时候每个计算单元先把自己梯度切分成N块，然后发送到相邻下一个模块。现在有N个节点，那么N-1次发送后就能实现所有节点掌握所有其他节点的数据。这个方法被证明是一个带宽最优算法。
在这里插入图片描述

Horovod

Horovod是基于Ring-AllReduce方法的深度分布式学习插件，以支持多种流行架构包括TensorFlow、Keras、PyTorch等。这样平台开发者只需要为Horovod进行配置，而不是对每个架构有不同的配置方法。

Horovod with PyTorch

https://horovod.readthedocs.io/en/stable/pytorch.html

To use Horovod with PyTorch, make the following modifications to your training script:

Run hvd.init().
Pin each GPU to a single process.
With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.
```
if torch.cuda.is_available():
    torch.cuda.set_device(hvd.local_rank())
```
Scale the learning rate by the number of workers.
Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.
Wrap the optimizer in hvd.DistributedOptimizer.
The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.
Broadcast the initial variable states from rank 0 to all other processes:
```
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)
```
This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.
Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.Accomplish this by guarding model checkpointing code with hvd.rank() != 0.

Example (also see a full training example):

import torch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))

imagenet上采用resnet50进行训练

  1 from __future__ import print_function
  2 
  3 import torch
  4 import argparse
  5 import torch.backends.cudnn as cudnn
  6 import torch.nn.functional as F
  7 import torch.optim as optim
  8 import torch.utils.data.distributed
  9 from torchvision import datasets, transforms, models
 10 import horovod.torch as hvd
 11 import os
 12 import math
 13 from tqdm import tqdm
 14 from distutils.version import LooseVersion
 15 
 16 # Training settings
 17 parser = argparse.ArgumentParser(description='PyTorch ImageNet Example',
 18                                  formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 19 parser.add_argument('--train-dir', default=os.path.expanduser('~/imagenet/train'),
 20                     help='path to training data')
 21 parser.add_argument('--val-dir', default=os.path.expanduser('~/imagenet/validation'),
 22                     help='path to validation data')
 23 parser.add_argument('--log-dir', default='./logs',
 24                     help='tensorboard log directory')
 25 parser.add_argument('--checkpoint-format', default='./checkpoint-{epoch}.pth.tar',
 26                     help='checkpoint file format')
 27 parser.add_argument('--fp16-allreduce', action='store_true', default=False,
 28                     help='use fp16 compression during allreduce')
 29 parser.add_argument('--batches-per-allreduce', type=int, default=1,
 30                     help='number of batches processed locally before '
 31                          'executing allreduce across workers; it multiplies '
 32                          'total batch size.')
 33 parser.add_argument('--use-adasum', action='store_true', default=False,
 34                     help='use adasum algorithm to do reduction')
 35 
 36 # Default settings from https://arxiv.org/abs/1706.02677.
 37 parser.add_argument('--batch-size', type=int, default=32,
 38                     help='input batch size for training')
 39 parser.add_argument('--val-batch-size', type=int, default=32,
 40                     help='input batch size for validation')
 41 parser.add_argument('--epochs', type=int, default=90,
 42                     help='number of epochs to train')
 43 parser.add_argument('--base-lr', type=float, default=0.0125,
 44                     help='learning rate for a single GPU')
 45 parser.add_argument('--warmup-epochs', type=float, default=5,
 46                     help='number of warmup epochs')
 47 parser.add_argument('--momentum', type=float, default=0.9,
 48                     help='SGD momentum')
 49 parser.add_argument('--wd', type=float, default=0.00005,
 50                     help='weight decay')
 51 
 52 parser.add_argument('--no-cuda', action='store_true', default=False,
 53                     help='disables CUDA training')
 54 parser.add_argument('--seed', type=int, default=42,
 55                     help='random seed')
 56 
 57 args = parser.parse_args()
 58 args.cuda = not args.no_cuda and torch.cuda.is_available()
 59 
 60 allreduce_batch_size = args.batch_size * args.batches_per_allreduce
 61 
 62 hvd.init()
 63 torch.manual_seed(args.seed)
 64 
 65 if args.cuda:
 66     # Horovod: pin GPU to local rank.
 67     torch.cuda.set_device(hvd.local_rank())
 68     torch.cuda.manual_seed(args.seed)
 69 
 70 cudnn.benchmark = True
 71 
 72 # If set > 0, will resume training from a given checkpoint.
 73 resume_from_epoch = 0
 74 for try_epoch in range(args.epochs, 0, -1):
 75     if os.path.exists(args.checkpoint_format.format(epoch=try_epoch)):
 76         resume_from_epoch = try_epoch
 77         break
 78 
 79 # Horovod: broadcast resume_from_epoch from rank 0 (which will have
 80 # checkpoints) to other ranks.
 81 resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
 82                                   name='resume_from_epoch').item()
 83 
 84 # Horovod: print logs on the first worker.
 85 verbose = 1 if hvd.rank() == 0 else 0
 86 
 87 # Horovod: write TensorBoard logs on first worker.
 88 try:
 89     if LooseVersion(torch.__version__) >= LooseVersion('1.2.0'):
 90         from torch.utils.tensorboard import SummaryWriter
 91     else:
 92         from tensorboardX import SummaryWriter
 93     log_writer = SummaryWriter(args.log_dir) if hvd.rank() == 0 else None
 94 except ImportError:
 95     log_writer = None
 96 
 97 # Horovod: limit # of CPU threads to be used per worker.
 98 torch.set_num_threads(4)
 99 
100 kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}
101 train_dataset = \
102     datasets.ImageFolder(args.train_dir,
103                          transform=transforms.Compose([
104                              transforms.RandomResizedCrop(224),
105                              transforms.RandomHorizontalFlip(),
106                              transforms.ToTensor(),
107                              transforms.Normalize(mean=[0.485, 0.456, 0.406],
108                                                   std=[0.229, 0.224, 0.225])
109                          ]))
110 # Horovod: use DistributedSampler to partition data among workers. Manually specify
111 # `num_replicas=hvd.size()` and `rank=hvd.rank()`.
112 train_sampler = torch.utils.data.distributed.DistributedSampler(
113     train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
114 train_loader = torch.utils.data.DataLoader(
115     train_dataset, batch_size=allreduce_batch_size,
116     sampler=train_sampler, **kwargs)
117 
118 val_dataset = \
119     datasets.ImageFolder(args.val_dir,
120                          transform=transforms.Compose([
121                              transforms.Resize(256),
122                              transforms.CenterCrop(224),
123                              transforms.ToTensor(),
124                              transforms.Normalize(mean=[0.485, 0.456, 0.406],
125                                                   std=[0.229, 0.224, 0.225])
126                          ]))
127 val_sampler = torch.utils.data.distributed.DistributedSampler(
128     val_dataset, num_replicas=hvd.size(), rank=hvd.rank())
129 val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=args.val_batch_size,
130                                          sampler=val_sampler, **kwargs)
131 
132 
133 # Set up standard ResNet-50 model.
134 model = models.resnet50()
135 
136 # By default, Adasum doesn't need scaling up learning rate.
137 # For sum/average with gradient Accumulation: scale learning rate by batches_per_allreduce
138 lr_scaler = args.batches_per_allreduce * hvd.size() if not args.use_adasum else 1
139 
140 if args.cuda:
141     # Move model to GPU.
142     model.cuda()
143     # If using GPU Adasum allreduce, scale learning rate by local_size.
144     if args.use_adasum and hvd.nccl_built():
145         lr_scaler = args.batches_per_allreduce * hvd.local_size()
146 
147 # Horovod: scale learning rate by the number of GPUs.
148 optimizer = optim.SGD(model.parameters(),
149                       lr=(args.base_lr *
150                           lr_scaler),
151                       momentum=args.momentum, weight_decay=args.wd)
152 
153 # Horovod: (optional) compression algorithm.
154 compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
155 
156 # Horovod: wrap optimizer with DistributedOptimizer.
157 optimizer = hvd.DistributedOptimizer(
158     optimizer, named_parameters=model.named_parameters(),
159     compression=compression,
160     backward_passes_per_step=args.batches_per_allreduce,
161     op=hvd.Adasum if args.use_adasum else hvd.Average)
162 
163 # Restore from a previous checkpoint, if initial_epoch is specified.
164 # Horovod: restore on the first worker which will broadcast weights to other workers.
165 if resume_from_epoch > 0 and hvd.rank() == 0:
166     filepath = args.checkpoint_format.format(epoch=resume_from_epoch)
167     checkpoint = torch.load(filepath)
168     model.load_state_dict(checkpoint['model'])
169     optimizer.load_state_dict(checkpoint['optimizer'])
170 
171 # Horovod: broadcast parameters & optimizer state.
172 hvd.broadcast_parameters(model.state_dict(), root_rank=0)
173 hvd.broadcast_optimizer_state(optimizer, root_rank=0)
174 
175 def train(epoch):
176     model.train()
177     train_sampler.set_epoch(epoch)
178     train_loss = Metric('train_loss')
179     train_accuracy = Metric('train_accuracy')
180 
181     with tqdm(total=len(train_loader),
182               desc='Train Epoch     #{}'.format(epoch + 1),
183               disable=not verbose) as t:
184         for batch_idx, (data, target) in enumerate(train_loader):
185             adjust_learning_rate(epoch, batch_idx)
186 
187             if args.cuda:
188                 data, target = data.cuda(), target.cuda()
189             optimizer.zero_grad()
190             # Split data into sub-batches of size batch_size
191             for i in range(0, len(data), args.batch_size):
192                 data_batch = data[i:i + args.batch_size]
193                 target_batch = target[i:i + args.batch_size]
194                 output = model(data_batch)
195                 train_accuracy.update(accuracy(output, target_batch))
196                 loss = F.cross_entropy(output, target_batch)
197                 train_loss.update(loss)
198                 # Average gradients among sub-batches
199                 loss.div_(math.ceil(float(len(data)) / args.batch_size))
200                 loss.backward()
201             # Gradient is applied across all ranks
202             optimizer.step()
203             t.set_postfix({'loss': train_loss.avg.item(),
204                            'accuracy': 100. * train_accuracy.avg.item()})
205             t.update(1)
206 
207     if log_writer:
208         log_writer.add_scalar('train/loss', train_loss.avg, epoch)
209         log_writer.add_scalar('train/accuracy', train_accuracy.avg, epoch)
210 
211 
212 def validate(epoch):
213     model.eval()
214     val_loss = Metric('val_loss')
215     val_accuracy = Metric('val_accuracy')
216 
217     with tqdm(total=len(val_loader),
218               desc='Validate Epoch  #{}'.format(epoch + 1),
219               disable=not verbose) as t:
220         with torch.no_grad():
221             for data, target in val_loader:
222                 if args.cuda:
223                     data, target = data.cuda(), target.cuda()
224                 output = model(data)
225 
226                 val_loss.update(F.cross_entropy(output, target))
227                 val_accuracy.update(accuracy(output, target))
228                 t.set_postfix({'loss': val_loss.avg.item(),
229                                'accuracy': 100. * val_accuracy.avg.item()})
230                 t.update(1)
231 
232     if log_writer:
233         log_writer.add_scalar('val/loss', val_loss.avg, epoch)
234         log_writer.add_scalar('val/accuracy', val_accuracy.avg, epoch)
235 
236 
237 # Horovod: using `lr = base_lr * hvd.size()` from the very beginning leads to worse final
238 # accuracy. Scale the learning rate `lr = base_lr` ---> `lr = base_lr * hvd.size()` during
239 # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
240 # After the warmup reduce learning rate by 10 on the 30th, 60th and 80th epochs.
241 def adjust_learning_rate(epoch, batch_idx):
242     if epoch < args.warmup_epochs:
243         epoch += float(batch_idx + 1) / len(train_loader)
244         lr_adj = 1. / hvd.size() * (epoch * (hvd.size() - 1) / args.warmup_epochs + 1)
245     elif epoch < 30:
246         lr_adj = 1.
247     elif epoch < 60:
248         lr_adj = 1e-1
249     elif epoch < 80:
250         lr_adj = 1e-2
251     else:
252         lr_adj = 1e-3
253     for param_group in optimizer.param_groups:
254         param_group['lr'] = args.base_lr * hvd.size() * args.batches_per_allreduce * lr_adj
255 
256 
257 def accuracy(output, target):
258     # get the index of the max log-probability
259     pred = output.max(1, keepdim=True)[1]
260     return pred.eq(target.view_as(pred)).cpu().float().mean()
261 
262 
263 def save_checkpoint(epoch):
264     if hvd.rank() == 0:
265         filepath = args.checkpoint_format.format(epoch=epoch + 1)
266         state = {
267             'model': model.state_dict(),
268             'optimizer': optimizer.state_dict(),
269         }
270         torch.save(state, filepath)
271 
272 
273 # Horovod: average metrics from distributed training.
274 class Metric(object):
275     def __init__(self, name):
276         self.name = name
277         self.sum = torch.tensor(0.)
278         self.n = torch.tensor(0.)
279 
280     def update(self, val):
281         self.sum += hvd.allreduce(val.detach().cpu(), name=self.name)
282         self.n += 1
283 
284     @property
285     def avg(self):
286         return self.sum / self.n
287 
288 
289 for epoch in range(resume_from_epoch, args.epochs):
290     train(epoch)
291     validate(epoch)
292     save_checkpoint(epoch)

Note PyTorch GPU support requires NCCL 2.2 or later. It also works
with NCCL 2.1.15 if you are not using RoCE or InfiniBand.

Start the training job and specify the number of workers on the command line as you normally would when using Horovod:

# run training with 4 GPUs on a single machine
$ horovodrun -np 4 python train.py

# run training with 8 GPUs on two machines (4 GPUs each)
$ horovodrun -np 8 -H hostname1:4,hostname2:4 python train.py