今天利用python和pytorch编写图像分类训练程序,好不容易噼里啪啦敲完键盘,运行之。。。。。,结果突然报错(RuntimeError cuda out of memory),使笔者大失所望,具体信息如下:
/usr/bin/python3.5 /home/xxx/train.py
Step 1: prepare train/test dataset
There are 121 classes
Step 1 has been completed ---------7.801877
Step 2: Begin to train the model
num_ftrs=2048
num_classes=121
Epoch [0/29] ----------
Traceback (most recent call last):
File "/home/xxx/train.py", line 121, in <module>
outputs=model(images)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 204, in forward
x = self.layer4(x)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 99, in forward
out = self.bn1(out)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 81, in forward
exponential_average_factor, self.eps)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 154.00 MiB (GPU 0; 23.65 GiB total capacity; 22.54 GiB already allocated; 18.00 MiB free; 257.96 MiB cached)
Process finished with exit code 1
其中采用的网络模型是torchvision自带的resnext101_32x8d模型,batch_size=100。其他代码不变,直接修改batch_size=50。并在命令行中启用 watch -n 0.1 nvidia-smi开启监控窗口,可以看到如下界面:
从图中可以看出,虽然有四块GPU卡,但是只用了其中一块,显存使用率已经过半。应该是batch_size=100的时候显存溢出了。
简单的通过减少batch_size的数值可以解决这个显存溢出的问题。但是这不是最完美的解决之道,而且四块卡没有得到很好的利用。后续将介绍 多卡训练模型的相关问题,敬请期待。
-------------------- 正文到此结束------------------------
推荐一个公众号:健哥聊量化,会持续推出股票相关基础知识,以及python实现的一些基本的分析代码。欢迎大家关注,二维码如下:
相关文章列表如下: