RuntimeError: CUDA out of memory并不是真的内存不够,而是由于代码中列表存储张量才爆显存

文章描述了一个在使用PyTorch进行模型训练时遇到的CUDA内存溢出问题,通过注释掉损失值的列表存储并改为取张量值,解决了内存不足的问题。作者怀疑是代码设计导致的内存占用过大。
摘要由CSDN通过智能技术生成

1.执行的代码:

    def train(self):
        self.model.train()
        losses = []
        train_iter = self.data_iter['train'] # 
        for step, (triplets, labels) in enumerate(train_iter):
            if self.p.gpu >= 0:
                triplets, labels = triplets.to("cuda"), labels.to("cuda")
            subj, rel = triplets[:, 0], triplets[:, 1]
            pred = self.model(self.g, subj, rel, 'tail') 
            loss_t = self.model.calc_loss(pred, labels)
            print(loss_t)
            losses_t.append(loss_t)

这个是运行的代码,代码运行后报错如下:

tensor(6.8747, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9403, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7224, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7367, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9669, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
Traceback (most recent call last):
  File "run.py", line 518, in <module>
    runner.fit()
  File "run.py", line 92, in fit
    train_loss = self.train()
  File "run.py", line 174, in train
    pred = self.model(self.g, subj, rel, 'tail') 
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 140, in forward
    x_h, x_t, r = self.exop(x, r, self.x_ops, self.r_ops)  # 线性变换后的向量表示
  File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 65, in exop
    x_head = x_tail = self.h_ops_dict[x_op](x_head)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/functional.py", line 1076, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.68 GiB total capacity; 6.23 GiB already allocated; 0 bytes free; 6.55 GiB reserved in total by PyTorch)

我以为是服务器平台的硬件配置问题,
但是因为这个代码原来是能够复现出来的,所以应该还是代码的问题

2.报错的解决过程:

我把代码中的losses_t.append(loss_t) 注释掉,
发现代码可以成功执行

# 这是打印的loss_t
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)

然后发现原因是由于loss_t是tensor类型,直接把loss_t加入到列表中,就会爆显存

如果想要存loss_t的值的话,利用loss_t.item() 把张量其中的值取出来

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

失眠的树亚

你的鼓励是我最大的创作动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值