pytorch 多 GPU 训练

最新推荐文章于 2022-11-17 19:15:36 发布

来啊，battle啊

最新推荐文章于 2022-11-17 19:15:36 发布

阅读量264

点赞数

文章标签： Pytorch 分布式训练

本文链接：https://blog.csdn.net/weixin_43738628/article/details/102457977

版权

Pytorch 中多 GPU 训练一二事

背景

在大数据时代，单机单卡的训练模式已经无法适应模型规模和数据量的提升了，因此使用多 GPU 训练模型逐渐成为主流。 Pytorch 在 4.0 版本中开始提供多 GPU 接口，那么本文主要简要介绍 Pytorch 中多 GPU 训练的两种方法。

关于多 GPU

多 GPU，从字面意思理解就是说我们的机器中存在两个以上的 GPU , 在安装了CUDA 的机器上使用命令 nvidia-smi 可以查看 GPU 数量以及其他信息，如下图所示：
在这里插入图片描述
这是单机多卡的情况，还有一种情况就是多个 GPU 分布在不同的机器上，Pytorch 针对两种情况提供了不同的训练接口，下面我们逐一介绍。

torch.nn.DataParallel()

关于这种方法的使用非常简单，只需要为你的模型加上一个 wraper 。

import torch.nn as nn
model = nn.DataParallel(model)

pytorch官网的介绍中，该方法在训练时每个 gpu 上都有一个模型副本，input数据会被平分成 n(n是训练过程中使用的 gpu 的数量)等份，而最后的反向传播都是在一个 gpu(默认是 gpu0) 进行的。因此在设置batch-size时应注意，需要考虑乘以 gpu 的数量，此外在某些情况下使用多卡的速度反而会比单卡低，例如数据量比较小。下面给出一个能跑的代码。

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
import os

input_size = 5
output_size = 2
batch_size = 30
data_size = 30

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("  In Model: input size", input.size(),
              "output size", output.size())
        return output
model = Model(input_size, output_size)

if torch.cuda.is_available():
    model.cuda()
    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # 就这一行
    model = nn.DataParallel(model)
    
for data in rand_loader:
    if torch.cuda.is_available():
        input_var = Variable(data.cuda())
    else:
        input_var = Variable(data)
    output = model(input_var)
    print("Outside: input size", input_var.size(), "output_size", output.size())

torch.nn.parallel.DistributedDataParallel

这是官网建议采用的方法，为分布式训练设计的框架，在单机上也能用，而且其性能是要优于上一个方法的。官网对其优点描述如下：

Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.

这里也给出一个直接可以跑的例子。

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
import os
from torch.utils.data.distributed import DistributedSampler
# 1) 初始化
torch.distributed.init_process_group(backend="nccl")

input_size = 5
output_size = 2
batch_size = 30
data_size = 90

# 2） 配置每个进程的gpu
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size).to('cuda')

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

dataset = RandomDataset(input_size, data_size)
# 3）使用DistributedSampler
rand_loader = DataLoader(dataset=dataset,
                         batch_size=batch_size,
                         sampler=DistributedSampler(dataset))

class Model(nn.Module):
    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("  In Model: input size", input.size(),
              "output size", output.size())
        return output
    
model = Model(input_size, output_size)

# 4) 封装之前要把模型移到对应的gpu
model.to(device)
    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # 5) 封装
    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank],
                                                      output_device=local_rank)
   
for data in rand_loader:
    if torch.cuda.is_available():
        input_var = Variable(data.cuda())
    else:
        input_var = Variable(data)
    
    output = model(input_var)
    print("Outside: input size", input_var.size(), "output_size", output.size())

命令行运行程序

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 torch_ddp.py

参考链接

来啊，battle啊

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pytorch 多 GPU 训练

Pytorch 中多 GPU 训练一二事背景在大数据时代，单机单卡的训练模式已经无法适应模型规模和数据量的提升了，因此使用多 GPU 训练模型逐渐成为主流。 Pytorch 在 4.0 版本中开始提供多 GPU 接口，那么本文主要简要介绍 Pytorch 中多 GPU 训练的两种方法。关于多 GPU多 GPU，从字面意思理解就是说我们的机器中存在两个以上的 GPU , 在安装了CUDA 的...
复制链接

扫一扫