运行pytorch框架下的图像分类训练程序,出现cuda out of memory,解决方法探索

28 篇文章 3 订阅
19 篇文章 0 订阅

今天利用python和pytorch编写图像分类训练程序,好不容易噼里啪啦敲完键盘,运行之。。。。。,结果突然报错(RuntimeError  cuda out of memory),使笔者大失所望,具体信息如下:

/usr/bin/python3.5 /home/xxx/train.py
Step 1: prepare train/test dataset
There are 121 classes
Step 1 has been completed ---------7.801877
Step 2: Begin to train the model
num_ftrs=2048
num_classes=121
Epoch [0/29] ----------
Traceback (most recent call last):
  File "/home/xxx/train.py", line 121, in <module>
    outputs=model(images)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 204, in forward
    x = self.layer4(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 99, in forward
    out = self.bn1(out)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 154.00 MiB (GPU 0; 23.65 GiB total capacity; 22.54 GiB already allocated; 18.00 MiB free; 257.96 MiB cached)

Process finished with exit code 1

其中采用的网络模型是torchvision自带的resnext101_32x8d模型,batch_size=100。其他代码不变,直接修改batch_size=50。并在命令行中启用 watch -n 0.1 nvidia-smi开启监控窗口,可以看到如下界面:

从图中可以看出,虽然有四块GPU卡,但是只用了其中一块,显存使用率已经过半。应该是batch_size=100的时候显存溢出了。

简单的通过减少batch_size的数值可以解决这个显存溢出的问题。但是这不是最完美的解决之道,而且四块卡没有得到很好的利用。后续将介绍 多卡训练模型的相关问题,敬请期待。

-------------------- 正文到此结束------------------------

推荐一个公众号:健哥聊量化,会持续推出股票相关基础知识,以及python实现的一些基本的分析代码。欢迎大家关注,二维码如下:

相关文章列表如下:

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值