pytorch 解决gpu训练只占一块卡

一:可以为不同的GPU分配不同的batch_size:

本文主要解决pytorch在进行模型训练时出现GPU的0卡占用显存比其他卡要多的问题。
出现0卡显存更高的原因:网络在反向传播的时候,计算loss的梯度默认都在0卡上计算。因此会比其他显卡多用一些显存,具体多用多少,主要还要看网络的结构。
因此,为了防止训练由于 out of memory 而中断。比较笨的办法是减少batch_size
那么没有更优雅的方法呢?答案是肯定的。那就是借用下transformer-xl中用到的 BalancedDataParallel类。可以为不同的GPU分配不同的batch_size:

import torch
from torch.nn.parallel.data_parallel import DataParallel
from torch.nn.parallel.parallel_apply import parallel_apply
from torch.nn.parallel._functions import Scatter


def scatter(inputs, target_gpus, chunk_sizes, dim=0):
    r"""
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    """

    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            try:
                return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
            except Exception:
                print('obj', obj.size())
                print('dim', dim)
                print('chunk_sizes', chunk_sizes)
                quit()
        if isinstance(obj, tuple) and len(obj) > 0:
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            return list(map(list, zip(*map(scatter_map, obj))))
        if isinstance(obj, dict) and len(obj) > 0:
            return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
    try:
        return scatter_map(inputs)
    finally:
        scatter_map = None


def scatter_kwargs(inputs, kwargs, target_gpus, chunk_sizes, dim=0):
    """Scatter with support for kwargs dictionary"""
    inputs = scatter(inputs, target_gpus, chunk_sizes, dim) if inputs else []
    kwargs = scatter(kwargs, target_gpus, chunk_sizes, dim) if kwargs else []
    if len(inputs) < len(kwargs):
        inputs.extend([() for _ in range(len(kwargs) - len(inputs))])
    elif len(kwargs) < len(inputs):
        kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))])
    inputs = tuple(inputs)
    kwargs = tuple(kwargs)
    return inputs, kwargs


class BalancedDataParallel(DataParallel):

    def __init__(self, gpu0_bsz, *args, **kwargs):
        self.gpu0_bsz = gpu0_bsz
        super().__init__(*args, **kwargs)

    def forward(self, *inputs, **kwargs):
        if not self.device_ids:
            return self.module(*inputs, **kwargs)
        if self.gpu0_bsz == 0:
            device_ids = self.device_ids[1:]
        else:
            device_ids = self.device_ids
        inputs, kwargs = self.scatter(inputs, kwargs, device_ids)
        if len(self.device_ids) == 1:
            return self.module(*inputs[0], **kwargs[0])
        replicas = self.replicate(self.module, self.device_ids)
        if self.gpu0_bsz == 0:
            replicas = replicas[1:]
        outputs = self.parallel_apply(replicas, device_ids, inputs, kwargs)
        return self.gather(outputs, self.output_device)

    def parallel_apply(self, replicas, device_ids, inputs, kwargs):
        return parallel_apply(replicas, inputs, kwargs, device_ids)

    def scatter(self, inputs, kwargs, device_ids):
        bsz = inputs[0].size(self.dim)
        num_dev = len(self.device_ids)
        gpu0_bsz = self.gpu0_bsz
        bsz_unit = (bsz - gpu0_bsz) // (num_dev - 1)
        if gpu0_bsz < bsz_unit:
            chunk_sizes = [gpu0_bsz] + [bsz_unit] * (num_dev - 1)
            delta = bsz - sum(chunk_sizes)
            for i in range(delta):
                chunk_sizes[i + 1] += 1
            if gpu0_bsz == 0:
                chunk_sizes = chunk_sizes[1:]
        else:
            return super().scatter(inputs, kwargs, device_ids)
        return scatter_kwargs(inputs, kwargs, device_ids, chunk_sizes, dim=self.dim)

   

从代码中可以看到,BalancedDataParallel继承了 torch.nn.DataParallel,之后通过自定义0卡batch_size的大小gpu0_bsz,即让0卡少一点数据。均衡0卡和其他卡的显存占用。调用代码如下:

import BalancedDataParallel

device_ids=[0,1]

 if n_gpu > 1:
   # model = BalancedDataParallel(gpu0_bsz, model, dim=0).to(device)

 model = BalancedDataParallel(args.batchsize_0, model, dim=0).to(device_ids[0])

model = model.cuda() 

gpu0_bsz:GPU的0卡batch_size;
model:模型;
dim:batch所在维度

二、使用torch.nn.DataParallel训练自行分配内存

训练时:定义模型后:

model = torch.nn.DataParallel(model,device=device_ids)

如果训练模型使用的算法是在GPU上使用torch.nn.DataParallel加载多个GPU进行训练,那么是不可以直接在cpu上进行直接推理,原因是权重文件中的节点名称中均增加了一个module的参数文件。为了能再cpu上进行加载,除了需要注意使用跨设备加载时使用,map_location="cpu"之外,还需要对权重文件进行修改。经过修改后的权重文件已经去掉节点名称中的module,此时进行测试发现,进行推理时进行正常.

如果训练模型使用的算法是在GPU上使用torch.nn.DataParallel加载多个GPU进行训练,GPU测试时pytorch错误:Missing key(s) in state_dict:  Unexpected key(s) in state_dict:

解决:

在模型参数被加载到模型前加下面的语句,且在加载模型中添加一个False:

model = model.to(device)

model = nn.DataParallel(model)

model.load_state_dict(pth,False)
 

def change_feature(check_point):

    import collections
    dicts = collections.OrderedDict()

    for k, value in check_point.items():
        #print("names:{}".format(k))                    # 打印结构
        #print("shape:{}".format(value.size()))

        if "module" in k:            # 去除命名中的module
            k = k.split(".")[1:]
            k = ".".join(k)
            #print(k)
            dicts[k] = value

    return dicts

if args.cuda:
    model = model.cuda()
    model = torch.nn.DataParallel(model)
    model.load_state_dict(torch.load(args.checkpoint),False)
else:
    check_point = torch.load(args.checkpoint, map_location=torch.device('cpu'))
    #more gpu train
    dict=change_feature(check_point)
    model.load_state_dict(dict)
    #one gpu train 
    model.load_state_dict(check_point)


 


原文链接:https://blog.csdn.net/weixin_43922901/article/details/106117774

原文链接:https://blog.csdn.net/qq_39852676/article/details/106928329

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值