目录
电脑配置:
- 工作站:Dell Precision 5820
- GPU:一块2080ti,一块3090
刚入手3090,以前一直用的2080ti,都只用来跑深度学习的代码,会不断更新,希望大佬们赐教~
升级CUDA版本
- 3090只支持CUDA11的,之前一直用的10.2版本,需要更新。
报"imbalance between your GPUs."的警告
- 报以下warning:
There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.
- 可以参考这个链接,用下面这行代码。不过我是换了个代码跑,就正常了……
net = nn.DataParallel(model.cuda(), device_ids=[0,1])
pytorch,设置多GPU
关于batch_size的设置
- 有了多GPU后,batch_size指的是每一块GPU上分配的,所以在设置超参数的时候,只要输入正常的batch_size,不需要batch_size * GPU数,不然可能会报下面这个错:
RuntimeError: CUDA out of memory. Tried to allocate 1.30 GiB (GPU 1; 10.76 GiB total capacity; 6.51 GiB already allocated; 1.18 GiB free; 8.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
报UserWarning: Detected call of lr_scheduler.step() before optimizer.step().
- 报以下warning:
UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
- 参考这个链接,调换
optimizer.step()
和scheduler.step()
的顺序即可
GPU温度
- 65-75摄氏度为宜,可以用
nvidia-smi
命令看温度。
指定使用某一块GPU
- 参考这个链接,在文件开头用下面这行代码:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = 'gpu_id'
gpu_id
是GPU的ID号,可以在终端里输入nvidia-smi
查看。
两块不同型号的GPU是否冲突
- 参考这个链接,只用来跑深度学习的代码话,框架的版本能匹配两块卡就行。