tensorflow报No OpKernel was registered to support Op ‘NcclAllReduce‘

导读

在使用tensorflow训练模型的时候报如下错误

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/graph_builder/utils.py:160) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'

	 [[AllReduceGrads/NcclAllReduce]]

Errors may have originated from an input operation.
Input Source operations connected to node AllReduceGrads/NcclAllReduce:
 tower0/gradients/AddN_373 (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/train/tower.py:276)
terminate called without an active exception
terminate called recursively
terminate called recursively
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
Aborted (core dumped)

这个错误是发生在使用多个GPU进行并行训练的时候,使用单个GPU训练的时候并没有报错,而且指定的GPU会占用135M的GPU内存。

环境

  • 系统:Ubuntu16.04
  • cuda版本:10.1
  • cudnn版本:8.0.2
  • tensorflow-gpu:1.14.0

错误原因分析及解决办法

其实这个错误主要是因为环境配置问题导致,在训练的时候报如上错误的时候,在查找上面的输出信息的前面发现如下信息

2020-08-14 13:58:07.324004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324205: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324311: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324614: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-14 13:58:07.324666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:

通过分析上面的错误可以发现,是由于找不到libcu*.so.10.0导致的,所以可以很肯定这个错误是由于cuda的版本导致的。因为我安装的是cuda10.1的版本,而TensorFlow1.14需要的是cuda10.0的版本,所以针对这种情况,要么更换cuda的版本要么更换TensorFlow的版本,关于TensorFlow和cuda对应的版本,TensorFlow官方给出了如下信息

官方文档说明:https://www.tensorflow.org/install/source?hl=zh-cn

通过上面的版本对应表可以发现,TensorFlow_gpu-1.14.0所对应的cuda的版本应该是10.0,我最终更改了cuda的版本解决了这个问题。

 

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

修炼之路

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值