**为什么设置了随机种子结果还是不一样,提供一些小细节也许可以解决
一、保证同一台机子环境,保证cuda>10.2 2.
二、保证torch、numpy版本比较新
三、在import结束后设置随机种子且运行set(seed=42)#设一个数字
** `def seed(seed):
random.seed(seed) #为python设置随机种子
np.random.seed(seed) #为numpy设置随机种子
torch.manual_seed(seed) #为CPU设置随机种子
torch.cuda.manual_seed(seed) #为当前GPU设置随机种子
os.environ['PYTHONHASHSEED'] = str(seed)
torch.cuda.manual_seed_all(seed) #为所有GPU设置随机种子
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.enabled = True
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'
torch.use_deterministic_algorithms(True)``
- torch.use_deterministic_algorithms(True)能够找到不能保持复现性的错误
如果报错RuntimeError: Deterministic behavior was enabled with either
torch.set_deterministic(True)or
at::Context::setDeterministic(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
就是添加os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'
- os.environ[‘CUBLAS_WORKSPACE_CONFIG’] = ':16:8’如果还报错
RuntimeError: Deterministic behavior was enabled with either
torch.use_deterministic_algorithms(True)
orat::Context::setDeterministicAlgorithms(true)
,`
参考链接: link - 报错
max_pool3d_with_indices_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.
参考了两个pytorch的issue,问题比较接近,但是有些算法目前还是没法解决随机性的问题:
参考链接: link
参考链接:link
这里有两个解决办法:
(1)在发生错误的位置前后,设置
torch.use_deterministic_algorithms(False)
loss = lossCE()#错误位置
torch.use_deterministic_algorithms(True)
(2)比如我这里其实报错第一行可以看到max_pool3d_with_indices_backward_cuda does not have a deterministic implementation
主要是max_pool发生错误,返回代码找到其中一处maxpool3d时会报错,我取消了那一句,不让代码maxpool也可以解决,其他办法目前没有想到,但是也许更多的问题将会在在pytorch中 找到解决,参见上面第二个链接。
四、当设置num_worker>1时,可以设置一下worker_init_fn
train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True,num_workers=6, worker_init_fn=np.random.seed(seed),pin_memory=True)