1.执行的代码:
def train(self):
self.model.train()
losses = []
train_iter = self.data_iter['train'] #
for step, (triplets, labels) in enumerate(train_iter):
if self.p.gpu >= 0:
triplets, labels = triplets.to("cuda"), labels.to("cuda")
subj, rel = triplets[:, 0], triplets[:, 1]
pred = self.model(self.g, subj, rel, 'tail')
loss_t = self.model.calc_loss(pred, labels)
print(loss_t)
losses_t.append(loss_t)
这个是运行的代码,代码运行后报错如下:
tensor(6.8747, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9403, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7224, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.7367, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9669, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
2024-03-15 17:42:50 [ERROR] [ -4280403,Out of memory ] cuMemAlloc(): fail to allocate 32768 KB memory (out of memory)
Traceback (most recent call last):
File "run.py", line 518, in <module>
runner.fit()
File "run.py", line 92, in fit
train_loss = self.train()
File "run.py", line 174, in train
pred = self.model(self.g, subj, rel, 'tail')
File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 140, in forward
x_h, x_t, r = self.exop(x, r, self.x_ops, self.r_ops) # 线性变换后的向量表示
File "/root/GCN4KGC-main/RGCN+CompGCN+LTE/model/lte_models.py", line 65, in exop
x_head = x_tail = self.h_ops_dict[x_op](x_head)
File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/root/.local/conda/envs/new/lib/python3.7/site-packages/torch/nn/functional.py", line 1076, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.68 GiB total capacity; 6.23 GiB already allocated; 0 bytes free; 6.55 GiB reserved in total by PyTorch)
我以为是服务器平台的硬件配置问题,
但是因为这个代码原来是能够复现出来的,所以应该还是代码的问题
2.报错的解决过程:
我把代码中的losses_t.append(loss_t) 注释掉,
发现代码可以成功执行
# 这是打印的loss_t
tensor(6.9063, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
然后发现原因是由于loss_t是tensor类型,直接把loss_t加入到列表中,就会爆显存
如果想要存loss_t的值的话,利用loss_t.item() 把张量其中的值取出来