写完了训练的代码,运行了以后发现了这么一个错:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [544, 768]], which is output 0 of ViewBackward, is at version 1; expected version 0 instead.
Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
大概意思就是说在计算梯度的时候检查出某个Variable有被一个 inplace operation
修改。
照着提示信息,设置:torch.autograd.set_detect_anomaly = True
,再运行一下,得到下面更详细的输出:
Traceback (most recent call last):
File "E:\xxx\main\main.py", line 71, in <module>
main(args)
File "E:\xxx\main\main.py", line 57, in main
train(config)
File "E:\xxx\train\train.py", line 114, in train
total_loss.backward()
File "D:\Anaconda3\envs\nlp\lib\site-packages\torch\tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "D:\Anaconda3\envs\nlp\lib\site-packages\torch\autograd\__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [544, 768]], which is output 0 of ViewBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
可以看到是反向传播的过程中出错的。
那么什么是 inplace operation
呢?
根据pytorch论坛的问答帖子:
- What is in-place operation?
- Encounter the RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation。
inplace operation
就是直接对tensor的内容进行修改,而没有使用复制的副本 (An in-place operation is an operation that changes directly the content of a given Tensor without making a copy)。
-
在pytorch中,
inplace operation
可以是一些.add_()
或.scatter_()
导致的。对于.add_()方法,是直接在tensor上进行修改的,可以把x.add_(y)改成x = x + y。如果需要复制一个副本话,参照第二个帖子的方法,可以使用.clone()方法。 -
在python中,
inplace operation
可以是一些+=
或*=
导致的。比如 x += y,需要改成 x = x +y