使用pytorch做分布式训练时,遇到错误:
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
解决方案1:在环境变量增加设置
export MKL_SERVICE_FORCE_INTEL=1
解决方案2:在环境变量增加设置
export MKL_THREADING_LAYER=GNU
问题分析:
Grepping conda manifests, libgomp is pulled in by libgcc-ng
, which is in turn pulled in by, uh, pretty much everything. So the culprit is more likely to be whoever's setting MKL_THREADING_LAYER=INTEL
. As far as that goes, well, it's weird.
import os def print_layer(prefix): print(f'{prefix}: {os.environ.get("MKL_THREADING_LAYER")}') if __name__ == '__main__': print_layer('Pre-import') import numpy as np from torch import multiprocessing as mp print_layer('Post-import') mp.set_start_method('spawn') p = mp.Process(target=print_layer, args=('Child',)) p.start() p.join()
See, if torch is imported before numpy then the child process here gets a GNU threading layer (even though the parent doesn't have the variable defined).
Pre-import: None
Post-import: None
Child: GNU
But if the imports are swapped so numpy is imported before torch, the child process gets an INTEL threading layer
Pre-import: None
Post-import: None
Child: INTEL
So I suspect numpy - or ones of its imports - is messing with the env
parameter of Popen
, but half an hour's search and I can't figure out how.