RuntimeError: CUDA out of memory并不是真的内存不够，而是由于代码中列表存储张量才爆显存

失眠的树亚

于 2024-03-15 18:11:32 发布

阅读量762

点赞数 10

分类专栏：问题记录文章标签：深度学习 python 人工智能

本文链接：https://blog.csdn.net/weixin_44021274/article/details/136746729

版权

问题记录专栏收录该内容

35 篇文章 0 订阅

订阅专栏

文章描述了一个在使用PyTorch进行模型训练时遇到的CUDA内存溢出问题，通过注释掉损失值的列表存储并改为取张量值，解决了内存不足的问题。作者怀疑是代码设计导致的内存占用过大。

摘要由CSDN通过智能技术生成

1.执行的代码：

    def train(self):
        self.model.train()
        losses = []
        train_iter = self.data_iter['train'] # 
        for step, (triplets, labels) in enumerate(train_iter):
            if self.p.gpu >= 0:
                triplets, labels = triplets.to("cuda"), labels.to("cuda")
            subj, rel = triplets[:, 0], triplets[:, 1]
            pred = self.model(self.g, subj, rel, 'tail') 
            loss_t = self.model.calc_loss(pred, labels)
            print(loss_t)
            losses_t.append(loss_t)

这个是运行的代码，代码运行后报错如下：

tensor(6.8747, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9403, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7224, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7367, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9669, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
Traceback (most recent call last):
  File "run.py", line 518, in <module>
    runner.fit()
  File "run.py", line 92, in fit
    train_loss = self.train()
  File "run.py", line 174, in train
    pred = self.model(self.g, subj, rel, 'tail') 
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 140, in forward
    x_h, x_t, r = self.exop(x, r, self.x_ops, self.r_ops)  # 线性变换后的向量表示
  File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 65, in exop
    x_head = x_tail = self.h_ops_dict[x_op](x_head)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/functional.py", line 1076, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.68 GiB total capacity; 6.23 GiB already allocated; 0 bytes free; 6.55 GiB reserved in total by PyTorch)

我以为是服务器平台的硬件配置问题，
但是因为这个代码原来是能够复现出来的，所以应该还是代码的问题

2.报错的解决过程：

我把代码中的losses_t.append(loss_t) 注释掉，
发现代码可以成功执行

# 这是打印的loss_t
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)

然后发现原因是由于loss_t是tensor类型，直接把loss_t加入到列表中，就会爆显存

如果想要存loss_t的值的话，利用loss_t.item() 把张量其中的值取出来

失眠的树亚

关注

10
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录