pytorch的多gpu训练

最新推荐文章于 2024-06-19 16:12:40 发布

猫猫与橙子

最新推荐文章于 2024-06-19 16:12:40 发布

阅读量3k

点赞数 1

分类专栏：深度学习pytorch使用文章标签：多gpu训练 pytorch

本文链接：https://blog.csdn.net/qq_22764813/article/details/91410748

版权

深度学习pytorch使用专栏收录该内容

18 篇文章 5 订阅

订阅专栏

1.多GPU训练，出现out of memory

出现情景：预训练模型是使用gpu0训练得到，然后要在多gpu的服务器上进行微调，使用gpu id为[4,5,6,7]，然后出现报错如下：

cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

错误原因分析：在加载预训练模型的位置报错，在台式机（只有一块gpu）将模型加载打印输出参数：

代码：

checkpoint = torch.load("/home/final.pth")
for k, v in checkpoint.items():
     print(k)
     print(v)

打印输出出现：

发现模型加载的时候就将参数传入到gpu当中，而我在服务器上出现报错应该是模型参数直接加载到了gpu id = 0 的显卡上，但是gpu id= 0的显卡上显存已经满了；所以报出了内存溢出的问题；但是我在操作的时候，直接使用gpu id = 4的显卡也会出现错误，出现错误的模型加载代码如下：

if conf.pretrained == True:
      checkpoint = torch.load("./model_mobilefacenet.pth",
map_location={'cuda:1':'cuda:0'})
      self.model_mobile.load_state_dict(checkpoint)

然后修改成这样也出错：

if conf.pretrained == True:
      checkpoint = torch.load("./model_mobilefacenet.pth",
,map_location=lambda storage, loc: storage.cuda(0))
      self.model_mobile.load_state_dict(checkpoint)

而自己在进行多gpu训练时，代码如下：

conf.device = torch.device("cuda:4" if torch.cuda.is_available() else "cpu")#从那块gpu开始训练就写那块gpu，这里gpu_id=0

 if torch.cuda.device_count() > 1:
     self.model = torch.nn.DataParallel(self.model, device_ids=[4，5,6,7])
     self.model.to(conf.device)
     self.model_mobile = torch.nn.DataParallel(self.model_mobile, device_ids=[4，5,6,7])
self.model_mobile.to(conf.device)

最终修改成为这样就没有报错了：

self.model_mobile = MobileFaceNet(conf.embedding_size)
if conf.pretrained == True:
      checkpoint = torch.load("./model_mobilefacenet.pth") 
      self.model_mobile.load_state_dict(checkpoint)

if torch.cuda.device_count() > 1:
    self.model = torch.nn.DataParallel(self.model)
    self.model_mobile = torch.nn.DataParallel(self.model_mobile)

训练时，使用如下命名行：

CUDA_VISIBLE_DEVICES="4,5,6,7" python train.py

同时在训练脚本头文件位置加入：

os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"

这样就是使用多gpu训练，并且不会报错。

不过我也没搞太清楚，之前不加载预训练模型的时候，前一种设置多gpu训练的方式是可用的，但是现在不能用了，有没有小伙伴知道了，希望告知；

原因：加载预训练模型时，参数中带有gpu id的信息，而这个信息是与我训练使用的gpu id不一样，所以加载了预训练模型后，之前的方法不可以用了；

2.pytorch使用多gpu训练时gpu负载不均衡的问题

第一个GPU的计算量往往比较大，后面的gpu占用较少，当将batch size设置变大，训练会因为第一块gpu内存溢出而终止训练，但是其他gpu的显存有可能还没有占到总量的一半；

原因：所有loss都是汇总在主gpu（第一块gpu）中计算；

参考（https://blog.csdn.net/weixin_40087578/article/details/87186613，

https://discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/22，

https://www.pytorchtutorial.com/pytorch-large-batches-multi-gpu-and-distributed-training/）

解决方法：就是将loss函数的前向计算写入到self.model的前向计算当中，这样就是每块gpu计算loss在都是单独的，不用汇总在独一块gpu中进行计算；

使用之前，四块gpu分别是9G，4G，4G，4G；

使用之后，四块gpu分别是10G，9G，9G，9G；

但是这样操作后有一个弊端，无法在训练过程中，在验证集上进行结果测试，因为model当中不仅仅包括提取图片特征，同时包括loss函数；

猫猫与橙子

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
pytorch的多gpu训练

1.多GPU训练，出现out of memory出现情景：预训练模型是使用gpu0训练得到，然后要在多gpu的服务器上进行微调，使用gpu id为[4,5,6,7]，然后出现报错如下：cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25错误原因分析：在加载预训练模型的...
复制链接

扫一扫