![v2-c47227853472f031a6a60d94bb582c48_1440w.jpg?source=172ae18b](http://img-01.proxy.5ce.com/view/image?&type=2&guid=76f5a732-c22f-eb11-8da9-e4434bdf6706&url=https://pic1.zhimg.com/v2-c47227853472f031a6a60d94bb582c48_1440w.jpg?source=172ae18b)
在 上篇文章中,我描述了pytorch的多GPU数据并行化,但是这并不是最优的方法。
游凯超:pytorch 多GPU数据并行化zhuanlan.zhihu.com![v2-c47227853472f031a6a60d94bb582c48_180x120.jpg](http://img-02.proxy.5ce.com/view/image?&type=2&guid=76f5a732-c22f-eb11-8da9-e4434bdf6706&url=https://pic1.zhimg.com/v2-c47227853472f031a6a60d94bb582c48_180x120.jpg)
根据官方文档的说法:
Multi-Process Single-GPU
This is the highly recommended way to use DistributedDataParallel, with multiple processes, each of which operates on a single GPU. This is currently the fastest approach to do data parallel training using PyTorch and applies to both single-node(multi-GPU) and multi-node data parallel training. It is proven to be significantly faster than torch.nn.DataParallel for single-node multi-GPU data parallel training.
所以,最好的方式是使用分布式训练,多个进程,每个进程用1块GPU。
为了更清楚地探究这个问题,我在这里给出多进程-每个进程多GPU的例子。其实,每个进程中的多个GPU就和DataParallel是一样的。分布式给我们带来了哪些问题呢?
1. 权重初始化一次,广播到所有的节点去
看样例代码,在每个进程中,权重的初始值不一样,但是在经过了DistributedDataParallel之后,权重变得一样了。(和master节点同步)
import
输出为:
at rank 0, before init, the weight is 0.0
at rank 1, before init, the weight is 1.0
at rank 1, after init, the weight is 0.0
at rank 0, after init, the weight is 0.0
2. 各个进程之间的梯度是求平均的,但是各个进程之内的梯度还是对batch dimension求和的。
看例子:
import torch.nn as nn
torch.distributed.init_process_group(backend='gloo')
rank = torch.distributed.get_rank()
device_ids = [2*rank, 2*rank+1]
output_device = torch.device(device_ids[0])
model = nn.Linear(in_features=1, out_features=1, bias=False).cuda(output_device)
model.weight.data.zero_()
model.weight.data.add_(1)
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=device_ids, output_device=output_device)
x = torch.tensor([2*rank, 2*rank + 1], dtype=torch.float).cuda(output_device).view(-1, 1)
y = model(x)
label = torch.zeros(2, 1, dtype=torch.float).cuda(output_device)
loss = torch.sum((y - label)**2)
loss.backward()
print(model.module.weight.grad.item())
根据上一篇文章中的推导,
程序的输出为:
14.0
14.0
验证了我们的理论计算。
小声bb一点:用的时候感觉nccl总是出问题,还不如gloo稳定。
至此,pytorch分布式训练、多GPU训练的脉络已经比较清晰了。
多个机器共同训练的部分,还没有看。不过,能够在一台机器上用多个GPU,也差不多能满足我们的一般需求了吧。
暂时先这样了。