使用Fluid训练模型时，使用多卡训练报错

飞桨PaddlePaddle

于 2019-03-08 11:25:34 发布

阅读量2k

点赞数

本文链接：https://blog.csdn.net/PaddlePaddle/article/details/88343934

版权

问题描述：使用Fluid训练模型时，使用多卡训练报错
报错输出：

EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory ) 
 Please specify its path correctly using following ways: 
 Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS. 
 For instance, issue command: export LD_LIBRARY_PATH=... 
 Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:157]
PaddlePaddle Call Stacks: 
0       0x7f19fce44e96p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1       0x7f19fe6ea71ep paddle::platform::dynload::GetNCCLDsoHandle() + 1822
2       0x7f19fcf3d0f9p void std::__once_call_impl<std::_Bind_simple<decltype (ncclCommInitAll({parm#1}...)) paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1} ()> >() + 9
3       0x7f1a6dfc2a80p pthread_once + 80
4       0x7f19fcf40651p paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace

问题分析：
从报错信息中发现EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )语句，即表示没有找到libnccl.so文件，这可能是由于nccl安装不正确导致的。
解决方法：

报错信息提示没有找到libnccl.so文件，可以尝试全局搜索一下该文件。

find / -name "libnccl.so*"

然后将找到的路径添加到 LD_LIBRARY_PATH 环境变量中则可。

问题拓展：

NCCL是Nvidia Collective multi-GPU Communication Library的简称，它是一个实现多GPU的collective communication通信（all-gather, reduce, broadcast）库，Nvidia做了很多优化，以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。

在深度学习使用多GPU并行训练时，通常会使用NCCL进行通信。

飞桨PaddlePaddle

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Fluid训练模型时，使用多卡训练报错

问题描述：使用Fluid训练模型时，使用多卡训练报错报错输出：EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory ) Please specify its path cor...
复制链接

扫一扫