Ubuntu18.04，两块GPU，踩坑记录。

最新推荐文章于 2024-07-05 13:07:04 发布

小何同学0.0

最新推荐文章于 2024-07-05 13:07:04 发布

阅读量3.4k

点赞数 2

文章标签： pytorch 深度学习人工智能

本文链接：https://blog.csdn.net/Sovereign00/article/details/122230091

版权

升级CUDA版本

3090只支持CUDA11的，之前一直用的10.2版本，需要更新。

报"imbalance between your GPUs."的警告

报以下warning：

There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.

可以参考这个链接，用下面这行代码。不过我是换了个代码跑，就正常了……

net = nn.DataParallel(model.cuda(), device_ids=[0,1])

pytorch，设置多GPU

有关模型训练，参考这个链接
有关模型读取，参考这个链接

关于batch_size的设置

有了多GPU后，batch_size指的是每一块GPU上分配的，所以在设置超参数的时候，只要输入正常的batch_size，不需要batch_size * GPU数，不然可能会报下面这个错：

RuntimeError: CUDA out of memory. Tried to allocate 1.30 GiB (GPU 1; 10.76 GiB total capacity; 6.51 GiB already allocated; 1.18 GiB free; 8.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

报UserWarning: Detected call of lr_scheduler.step() before optimizer.step().

报以下warning:

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

参考这个链接，调换optimizer.step()和scheduler.step()的顺序即可

GPU温度

65-75摄氏度为宜，可以用nvidia-smi命令看温度。

指定使用某一块GPU

参考这个链接，在文件开头用下面这行代码：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = 'gpu_id'

gpu_id是GPU的ID号，可以在终端里输入nvidia-smi查看。

两块不同型号的GPU是否冲突

参考这个链接，只用来跑深度学习的代码话，框架的版本能匹配两块卡就行。

小何同学0.0

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Ubuntu18.04，两块GPU，踩坑记录。

目录升级驱动升级CUDA版本注意pytorch版本间的差异报"imbalance between your GPUs."的警告电脑配置：工作站：Dell Precision 5820GPU：一块2080ti，一块3090刚入手3090，以前一直用的2080ti，都只用来跑深度学习的代码，会不断更新，希望大佬们赐教～升级驱动之前用2080ti 的时候驱动一直是450版本，现在升级成了495版本。升级CUDA版本3090只支持CUDA11的，之前一直用的10.2版本，需要更新。
复制链接

扫一扫