【win10和linux版】tensorflow-gpu配置踩坑心得(python3.6 + cuda10.0 + keras2.2.4 + tensorflow-gpu1.13.1)

注意:在linux是区分大小写的,但是在windows中,不区分,所以如果数据是.JPG,在linux中用.jpg是读不出来的

一、安装问题汇总

【anaconda】conda创建、查看、删除虚拟环境(anaconda命令集)
cuda10.0下载地址
cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.0

keras中调用tensorflow-gpu

import keras.backend.tensorflow_backend as KTF
import os

#进行配置,每个GPU使用90%上限现存
os.environ["CUDA_VISIBLE_DEVICES"]="0" # 使用编号为0,1号的GPU
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # 每个GPU上限控制在90%以内
session = tf.Session(config=config)
# 设置session
KTF.set_session(session)

配置完cuda10.0 + keras2.1.5 + tensorflow-gpu1.14后,再次报错:

2021-03-03 09:40:02.989207: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2021-03-03 09:40:02.989419: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at constant_op.cc:172 : Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 145, in <module>
    callbacks=[checkpoint_period, reduce_lr, early_stopping])
  File "D:\ananconda\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 2224, in fit_generator
    class_weight=class_weight)
  File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 1883, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 2478, in __call__
    **self.session_kwargs)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
 decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)	
 decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)

Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
 decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)	
 decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)

Original stack trace for 'decoder_conv0_depthwise_BN/FusedBatchNorm':
  File "/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 87, in <module>
    model = Deeplabv3(classes=NCLASSES,input_shape=(HEIGHT,WIDTH,3))
  File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 124, in Deeplabv3
    depth_activation=True, epsilon=1e-5)
  File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 47, in SepConv_BN
    x = BatchNormalization(name=prefix + '_depthwise_BN', epsilon=epsilon)(x)
  File "\ananconda\lib\site-packages\keras\engine\topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "\ananconda\lib\site-packages\keras\layers\normalization.py", line 181, in call
    epsilon=self.epsilon)
  File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1827, in normalize_batch_in_training
    epsilon=epsilon)
  File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1802, in _fused_normalize_batch_in_training
    data_format=tf_data_format)
  File "\ananconda\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 1329, in fused_batch_norm
    name=name)
  File "\ananconda\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4301, in _fused_batch_norm
    name=name)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "\ananconda\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
  • 发现是由于显存不足而报的错,将batch_size由8改为4,问题解决,正常运行。

二、keras在代码中实现tensorboard记录绘图

  • 方法是在model.fit_generator()中加入 TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True),然后再控制台中输入tensorboard --logdir=./logs --host=127.0.0.1 --port=6006,得到可视化界面。
import tensorflow as tf
from nets.deeplab import Deeplabv3
from keras.utils.data_utils import get_file
from keras.optimizers import Adam
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from PIL import Image
import time
import keras
from keras import backend as K
import numpy as np
import keras.backend.tensorflow_backend as KTF
import os

    # 开始训练
    model.fit_generator(generate_arrays_from_file(lines[:num_train], batch_size),
            steps_per_epoch=max(1, num_train//batch_size),
            validation_data=generate_arrays_from_file(lines[num_train:], batch_size),
            validation_steps=max(1, num_val//batch_size),
            epochs=100,
            initial_epoch=0,
            # 早停版本
            # callbacks=[checkpoint_period, reduce_lr, early_stopping,
            #            TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])
            callbacks = [checkpoint_period, reduce_lr,
                 TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])

    model.save_weights(log_dir+'last1.h5')

三、cudnn环境变量配置

ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: https://developer.nvidia.com/cuda-90-download-archive
  • 在配置好cudnn环境后,建议重启pycharm或者powershell跟新配置,以解决此问题

四、tensorboard报错

d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "d:\anaconda\anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "d:\anaconda\anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\Anaconda\anaconda\Scripts\tensorboard.exe\__main__.py", line 4, in <module>
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\main.py", line 40, in <module>
    from tensorboard import default
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\default.py", line 39, in <module>
    from tensorboard.plugins.beholder import beholder_plugin_loader
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\__init__.py", line 22, in <module>
    from tensorboard.plugins.beholder.beholder import Beholder
  File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py", line 199, in <module>
    class BeholderHook(tf.estimator.SessionRunHook):
  File "d:\anaconda\anaconda\lib\site-packages\tensorflow\python\util\deprecation_wrapper.py", line 106, in __getattr__
    attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow' has no attribute 'estimator'
  • 解决方法是在d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py中,将beholder.py中类class BeholderHook(tf.estimator.SessionRunHook):的括号中内容删除,就没有报错了。
    TF2 取消了SESSION,可以不用

  • 一定要是python3.6,不然tensorboard会报错!!!!!!!!!!

四、linux配置环境错误

2021-03-13 17:53:43.280913: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281034: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281145: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281253: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281359: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281463: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281569: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281592: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2021-03-13 17:53:43.281658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-13 17:53:43.281681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 
2021-03-13 17:53:43.281695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N N 
2021-03-13 17:53:43.281706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   N N 
  • 上述文件都在/usr/local/cuda-10.0/lib64/目录下,该目录下的文件下图列出。本来文件都好好的在原来的目录,但是tensorflow死活找不到他们。
    在这里插入图片描述
  • 报这个错之后,程序能跑,但是特别慢,龟速,只用到了cpu。我在修复之前的操作是将tensorflow-gpu==1.14.0卸载了,然后安装了一下tensorflow-gpu==1.13.1。很有可能是这一步改变了tensorflow的某个配置文件,导致它又能够读取到/usr/local/cuda-10.0/lib64/目录下的配置文件了。
  • 接下来我列举一下我认为可能有效的所有的操作:首先我卸载了miniconda3,安装了anaconda,重新安装了tensorflow-gpu==1.14.0keras==2.1.5,运行程序发现没用。
  • 然后我重新打开了~/.bashrc文件,将lib64的路径重新复制了一遍:export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/~/.bashrc的配置具体如下:
export PATH="/home/sunqilin/anaconda3/bin:$PATH"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/
export PATH=$PATH:/usr/local/cuda-10.0/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-10.0

五、报错:AttributeError: ‘str’ object has no attribute ‘decode’

https://github.com/fchollet/deep-learning-models/releases/download/v0.6/mobilenet_1_0_224_tf_no_top.h5
Traceback (most recent call last):
  File "/home/user-zhm/sql/Unet_attention/train.py", line 80, in <module>
    model.load_weights(weights_path, by_name=True, skip_mismatch=True)
  File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 2653, in load_weights
    reshape=reshape)
  File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 3407, in load_weights_from_hdf5_group_by_name
    original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'
  • 解决办法:卸载原来的h5py模块,安装2.10版本
pip install h5py==2.10 -i https://pypi.tuna.tsinghua.edu.cn/simple/

参考


六、源

  • 推荐豆瓣源
阿里云 http://mirrors.aliyun.com/pypi/simple/

中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/

豆瓣(douban) http://pypi.douban.com/simple/

清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/

中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/

七、多gpu运算报错

AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
  • tensorflow版本可能有问题,所以将tensorflow从1.14.0版本降到1.13.1版本,keras改到2.2.4版本,此问题得到解决。

八、pytorch==1.2.0安装

# CUDA 10.0
pip install torch==1.2.0 torchvision==0.4.0

九、opencv==3.4.3安装

  • opencv3 的安装
conda install opencv-python==3.4.0.12

十、h5py

  • 报错
    original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'
卸载原来的h5py模块,安装2.10.0版本
pip install h5py==2.10.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

参考

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值