-
问题描述:使用Fluid训练模型时,使用多卡训练报错
-
报错输出:
EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:157]
PaddlePaddle Call Stacks:
0 0x7f19fce44e96p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1 0x7f19fe6ea71ep paddle::platform::dynload::GetNCCLDsoHandle() + 1822
2 0x7f19fcf3d0f9p void std::__once_call_impl<std::_Bind_simple<decltype (ncclCommInitAll({parm#1}...)) paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1} ()> >() + 9
3 0x7f1a6dfc2a80p pthread_once + 80
4 0x7f19fcf40651p paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace
-
问题分析:
从报错信息中发现EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
语句,即表示没有找到libnccl.so文件,这可能是由于nccl安装不正确导致的。 -
解决方法:
报错信息提示没有找到libnccl.so文件,可以尝试全局搜索一下该文件。
find / -name "libnccl.so*"
然后将找到的路径添加到 LD_LIBRARY_PATH 环境变量中则可。
- 问题拓展:
NCCL是Nvidia Collective multi-GPU Communication Library的简称,它是一个实现多GPU的collective communication通信(all-gather, reduce, broadcast)库,Nvidia做了很多优化,以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。
在深度学习使用多GPU并行训练时,通常会使用NCCL进行通信。