linux节点状态 slurm,linux – 为什么slurm中的作业在TensorFlow脚本中无限期冻结?

我使用slurm(

http://slurm.schedmd.com/)工作负载管理器时遇到此错误.当我运行一些tensorflow python脚本时,有时会导致错误(附加).它似乎无法找到安装的cuda库,但我正在运行不需要GPU的脚本.因此,我发现为什么cuda会成为一个问题非常令人困惑.如果我不需要,为什么cuda安装会出现问题?

我从slurm-job_id文件中获得的唯一有用信息如下:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally

I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib

I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

E tensorflow/stream_executor/cuda/cuda_driver.cc:491] Failed call to cuInit: CUDA_ERROR_NO_DEVICE

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015

GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

"""

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0

I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

我一直以为张量流不需要GPU.所以我假设最后一个错误说没有GPU不会导致错误(如果我错了,请纠正我).

我不明白为什么我需要CUDA库.我正在尝试用GPU运行我的工作,如果我的工作是cpu工作,为什么我需要cuda库?

我尝试直接登录节点并启动张量流但我没有明显的错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally

I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib

I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

虽然我预计错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally

I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib

I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

E tensorflow/stream_executor/cuda/cuda_driver.cc:491] Failed call to cuInit: CUDA_ERROR_NO_DEVICE

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015

GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

"""

I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0

I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

我还在tensorflow库中发了一个正式的git问题:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值