1、one of the variables needed for gradient computation has been modified by an inplace operation
可能有多种原因
1、以 ‘_’ 结尾的函数
2、+=、/=这样的操作
3、激活函数如 torch.nn.ReLU(inplace=True)
如果是第二种,把 a += b 换成 a = a + b 即可,第三种则置inplace为False;若是第一种则麻烦一点,需要找替代函数或者自己实现该函数的功能
2、An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
在使用torch.multiprocessing.spawn时报此错。pytorch的torch.multiprocessing是基于python的原生multiprocessing进行二次开发的模块,在multiprocessing中,当在另一个CPU核心上产生一个子进程时,会将原来的主py文件再次导入,因此,在如下示例代码中
# start process for each gpu
mp.spawn(main, nprocs=args.g, args=(args,))
当其被加载至另一个CPU核心运行时,此语句会被再执行一遍,因此会递归产生大量子进程,最后导致该报错。因此此语句应该加入如下代码后进行运行:
if __name__ == '__main__':
# start process for each gpu
mp.spawn(main, nprocs=args.g, args=(args,))
这样子进程就不会再运行mp.spawn了
3、注:to_tensor会自动把输入图像从0-255转化成0-1的范围
官方文档见:https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html#torchvision.transforms.ToTensor
4、如果模型中有dropout或者batchnorm,在test的时候(包括从文件恢复模型),一定要加eval(),因为这两个模块在train和test的行为是不一样的,不加的话torch会自动当作train进行处理,会使得test的效果变差
5、NotImplementedError: Could not run 'aten::slow_conv3d_forward' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::slow_conv3d_forward' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].
在我的情况里面实际是因为数据没有用.to(device)
弄到cuda上