pytorch多机多卡分布式训练

最新推荐文章于 2024-08-08 18:13:10 发布

原@DeepGlint

最新推荐文章于 2024-08-08 18:13:10 发布

阅读量4.5k

点赞数 2

文章标签：深度学习 python 神经网络机器学习人工智能

本文链接：https://blog.csdn.net/weixin_44019216/article/details/106171189

版权

本文详细介绍了PyTorch中实现单机多卡和多机多卡分布式训练的方法，包括DataParallel的使用及DistributedDataParallel的初始化。在多GPU运算中，PyTorch会自动拆分数据并分配到各个GPU进行计算，然后合并结果。对于多机多卡场景，DistributedDataParallel允许独立的优化器和执行步骤，减少Tensor传输时间，并且每个进程都有独立的Python解释器，避免GIL问题。

摘要由CSDN通过智能技术生成

pytorch多机多卡分布式训练

1. 单机多卡情况使用DataParallel：
2. 多机多卡情况使用torch.nn.parallel.DistributedDataParallel()
- Initialization
- - TCP initialization 1 -- rank 0 network address (most useful I think)
  - TCP initialization 2 -- Shared file-system initialization
3. 谈谈python的GIL、多线程、多进程

注意：在 DataParallel 中，batch size 设置必须为单卡的 n 倍，但是在 DistributedDataParallel 内，batch size 设置于单卡一样即可。

1. 单机多卡情况使用DataParallel：

Imports and parameters

Import PyTorch modules and define parameters.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

Device

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Dummy DataSet

Make a dummy (random) dataset. You just need to implement the getitem

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

Simple Model

For the demo, our model just gets an input, performs a linear operation, and gives an output. However, you can use DataParallel on any model (CNN, RNN, Capsule Net etc.)

We’ve placed a print statement inside the model to monitor the size of input and output tensors. Please pay attention to what is printed at batch rank 0.

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super

最低0.47元/天解锁文章

原@DeepGlint

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
3
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫