python中多进程种子seed
python自带的random在不同子进程中会生成不同的种子,而numpy.random不同子进程会fork相同的主进程中的种子。pytorch中的Dataloader类的__getitem__()会在不同子进程中发生不同的torch.seed(),并且种子与多进程的worker id有关(查看**worker_init_fn参数说明)。但是三者互不影响,必须独立地处理。因此在写自己的数据准备代码时,如果使用了numpy中的随机化部件,一定要显示地在各个子进程中重新采样随机种子,或者使用pytho中的random发生随机种子。
举例参考:
下面例子中,random.uniform()和np.random.uniform()就不一样。
import numpy as np
import random
from multiprocessing import Pool
def Foo_np(seed=None):
# np.random.seed(seed)
return np.random.uniform(0, 1, 5) # random.uniform
pool = Pool(processes=8)
print np.array(pool.map(Foo_np, xrange(20)))
# [[ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.28917586 0.40997875 0.06308188 0.71512199 0.47386047]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]]
You can see that groups of up to 8 threads simultaneously forked with the same seed, giving me identical random sequences (I’ve marked the first group with arrows).
Calling np.random.seed()
within a subprocess forces the thread-local RNG instance to seed itself again from /dev/urandom
or the wall clock, which will (probably) prevent you from seeing identical output from multiple subprocesses. Best practice is to explicitly pass a different seed (or numpy.random.RandomState instance) to each subprocess, e.g.:
def Foo_np(seed=None):
local_state = np.random.RandomState(seed)
print local_state.uniform(0, 1, 5)
pool.map(Foo_np, range(20))
Pytorch中多个进程加载随机样本Dataloader解决方法:
https://discuss.pytorch.org/t/does-getitem-of-dataloader-reset-random-seed/8097/7
除了可选择python中的random解决外,
Instead, add this line to the top of your main script (and you need to use python 3)
import torch
import torch.multiprocessing as mp
mp.set_start_method('spawn')