【win10和linux版】tensorflow-gpu配置踩坑心得（python3.6 + cuda10.0 + keras2.2.4 + tensorflow-gpu1.13.1）

置顶 miracleo_

已于 2022-05-05 23:24:03 修改

阅读量1.8k

点赞数 1

分类专栏： python 深度学习文章标签： gpu 深度学习 cuda tensorflow 可视化

于 2021-03-03 16:21:37 首次发布

本文链接：https://blog.csdn.net/miracleoa/article/details/114290025

版权

python 同时被 2 个专栏收录

25 篇文章 18 订阅

订阅专栏

深度学习

10 篇文章 0 订阅

订阅专栏

文章目录

一、安装问题汇总
二、keras在代码中实现tensorboard记录绘图
三、cudnn环境变量配置
四、tensorboard报错
四、linux配置环境错误
五、报错：AttributeError: 'str' object has no attribute 'decode'
六、源
七、多gpu运算报错
八、pytorch==1.2.0安装
九、opencv==3.4.3安装
十、h5py
参考

注意：在linux是区分大小写的，但是在windows中，不区分，所以如果数据是.JPG，在linux中用.jpg是读不出来的

一、安装问题汇总

【anaconda】conda创建、查看、删除虚拟环境（anaconda命令集）
cuda10.0下载地址
 cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.0

keras中调用tensorflow-gpu

import keras.backend.tensorflow_backend as KTF
import os

#进行配置，每个GPU使用90%上限现存
os.environ["CUDA_VISIBLE_DEVICES"]="0" # 使用编号为0，1号的GPU
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # 每个GPU上限控制在90%以内
session = tf.Session(config=config)
# 设置session
KTF.set_session(session)

配置完cuda10.0 + keras2.1.5 + tensorflow-gpu1.14后，再次报错：

2021-03-03 09:40:02.989207: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2021-03-03 09:40:02.989419: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at constant_op.cc:172 : Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 145, in <module>
    callbacks=[checkpoint_period, reduce_lr, early_stopping])
  File "D:\ananconda\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 2224, in fit_generator
    class_weight=class_weight)
  File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 1883, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 2478, in __call__
    **self.session_kwargs)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
 decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)	
 decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)

Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
 decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)	
 decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)

Original stack trace for 'decoder_conv0_depthwise_BN/FusedBatchNorm':
  File "/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 87, in <module>
    model = Deeplabv3(classes=NCLASSES,input_shape=(HEIGHT,WIDTH,3))
  File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 124, in Deeplabv3
    depth_activation=True, epsilon=1e-5)
  File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 47, in SepConv_BN
    x = BatchNormalization(name=prefix + '_depthwise_BN', epsilon=epsilon)(x)
  File "\ananconda\lib\site-packages\keras\engine\topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "\ananconda\lib\site-packages\keras\layers\normalization.py", line 181, in call
    epsilon=self.epsilon)
  File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1827, in normalize_batch_in_training
    epsilon=epsilon)
  File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1802, in _fused_normalize_batch_in_training
    data_format=tf_data_format)
  File "\ananconda\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 1329, in fused_batch_norm
    name=name)
  File "\ananconda\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4301, in _fused_batch_norm
    name=name)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "\ananconda\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

发现是由于显存不足而报的错，将batch_size由8改为4，问题解决，正常运行。

二、keras在代码中实现tensorboard记录绘图

方法是在model.fit_generator()中加入 TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)，然后再控制台中输入tensorboard --logdir=./logs --host=127.0.0.1 --port=6006，得到可视化界面。

import tensorflow as tf
from nets.deeplab import Deeplabv3
from keras.utils.data_utils import get_file
from keras.optimizers import Adam
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from PIL import Image
import time
import keras
from keras import backend as K
import numpy as np
import keras.backend.tensorflow_backend as KTF
import os

    # 开始训练
    model.fit_generator(generate_arrays_from_file(lines[:num_train], batch_size),
            steps_per_epoch=max(1, num_train//batch_size),
            validation_data=generate_arrays_from_file(lines[num_train:], batch_size),
            validation_steps=max(1, num_val//batch_size),
            epochs=100,
            initial_epoch=0,
            # 早停版本
            # callbacks=[checkpoint_period, reduce_lr, early_stopping,
            #            TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])
            callbacks = [checkpoint_period, reduce_lr,
                 TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])

    model.save_weights(log_dir+'last1.h5')

三、cudnn环境变量配置

ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: https://developer.nvidia.com/cuda-90-download-archive

在配置好cudnn环境后，建议重启pycharm或者powershell跟新配置，以解决此问题

四、tensorboard报错

d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "d:\anaconda\anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "d:\anaconda\anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\Anaconda\anaconda\Scripts\tensorboard.exe\__main__.py", line 4, in <module>
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\main.py", line 40, in <module>
    from tensorboard import default
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\default.py", line 39, in <module>
    from tensorboard.plugins.beholder import beholder_plugin_loader
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\__init__.py", line 22, in <module>
    from tensorboard.plugins.beholder.beholder import Beholder
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py", line 199, in <module>
    class BeholderHook(tf.estimator.SessionRunHook):
  File "d:\anaconda\anaconda\lib\site-packages\tensorflow\python\util\deprecation_wrapper.py", line 106, in __getattr__
    attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow' has no attribute 'estimator'

解决方法是在d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py中，将beholder.py中类class BeholderHook(tf.estimator.SessionRunHook):的括号中内容删除，就没有报错了。
TF2 取消了SESSION，可以不用
一定要是python3.6，不然tensorboard会报错！！！！！！！！！！

四、linux配置环境错误

2021-03-13 17:53:43.280913: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281034: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281145: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281253: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281359: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281463: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281569: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281592: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2021-03-13 17:53:43.281658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-13 17:53:43.281681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 
2021-03-13 17:53:43.281695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N N 
2021-03-13 17:53:43.281706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   N N

上述文件都在/usr/local/cuda-10.0/lib64/目录下，该目录下的文件下图列出。本来文件都好好的在原来的目录，但是tensorflow死活找不到他们。
报这个错之后，程序能跑，但是特别慢，龟速，只用到了cpu。我在修复之前的操作是将tensorflow-gpu==1.14.0卸载了，然后安装了一下tensorflow-gpu==1.13.1。很有可能是这一步改变了tensorflow的某个配置文件，导致它又能够读取到/usr/local/cuda-10.0/lib64/目录下的配置文件了。
接下来我列举一下我认为可能有效的所有的操作：首先我卸载了miniconda3，安装了anaconda，重新安装了tensorflow-gpu==1.14.0，keras==2.1.5，运行程序发现没用。
然后我重新打开了~/.bashrc文件，将lib64的路径重新复制了一遍：export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/。~/.bashrc的配置具体如下：

export PATH="/home/sunqilin/anaconda3/bin:$PATH"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/
export PATH=$PATH:/usr/local/cuda-10.0/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-10.0

参考：Tensorflow 缺少 libcusolver.so.10 和 libcudnn.so.8 两个库的解决办法

五、报错：AttributeError: ‘str’ object has no attribute ‘decode’

https://github.com/fchollet/deep-learning-models/releases/download/v0.6/mobilenet_1_0_224_tf_no_top.h5
Traceback (most recent call last):
  File "/home/user-zhm/sql/Unet_attention/train.py", line 80, in <module>
    model.load_weights(weights_path, by_name=True, skip_mismatch=True)
  File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 2653, in load_weights
    reshape=reshape)
  File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 3407, in load_weights_from_hdf5_group_by_name
    original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'

解决办法：卸载原来的h5py模块，安装2.10版本

pip install h5py==2.10 -i https://pypi.tuna.tsinghua.edu.cn/simple/

参考

六、源

推荐豆瓣源

阿里云 http://mirrors.aliyun.com/pypi/simple/

中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/

豆瓣(douban) http://pypi.douban.com/simple/

清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/

中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/

七、多gpu运算报错

AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

tensorflow版本可能有问题，所以将tensorflow从1.14.0版本降到1.13.1版本，keras改到2.2.4版本，此问题得到解决。

八、pytorch==1.2.0安装

# CUDA 10.0
pip install torch==1.2.0 torchvision==0.4.0

九、opencv==3.4.3安装

opencv3 的安装

conda install opencv-python==3.4.0.12

十、h5py

报错

    original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'

卸载原来的h5py模块，安装2.10.0版本
pip install h5py==2.10.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

参考

miracleo_

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
【win10和linux版】tensorflow-gpu配置踩坑心得（python3.6 + cuda10.0 + keras2.2.4 + tensorflow-gpu1.13.1）

文章目录一、安装问题汇总二、keras在代码中实现tensorboard记录绘图参考一、安装问题汇总cuda10.0下载地址keras中调用tensorflow-gpuimport keras.backend.tensorflow_backend as KTFimport os#进行配置，每个GPU使用90%上限现存os.environ["CUDA_VISIBLE_DEVICES"]="0" # 使用编号为0，1号的GPUconfig = tf.ConfigProto()config.g
复制链接

扫一扫

专栏目录