文章目录
注意:在linux是区分大小写的,但是在windows中,不区分,所以如果数据是.JPG,在linux中用.jpg是读不出来的
一、安装问题汇总
【anaconda】conda创建、查看、删除虚拟环境(anaconda命令集)
cuda10.0下载地址
cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.0
keras中调用tensorflow-gpu
import keras.backend.tensorflow_backend as KTF
import os
#进行配置,每个GPU使用90%上限现存
os.environ["CUDA_VISIBLE_DEVICES"]="0" # 使用编号为0,1号的GPU
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # 每个GPU上限控制在90%以内
session = tf.Session(config=config)
# 设置session
KTF.set_session(session)
配置完cuda10.0 + keras2.1.5 + tensorflow-gpu1.14后,再次报错:
2021-03-03 09:40:02.989207: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2021-03-03 09:40:02.989419: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at constant_op.cc:172 : Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node decoder_conv0_depthwise_BN/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 145, in <module>
callbacks=[checkpoint_period, reduce_lr, early_stopping])
File "D:\ananconda\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 2224, in fit_generator
class_weight=class_weight)
File "D:\ananconda\lib\site-packages\keras\engine\training.py", line 1883, in train_on_batch
outputs = self.train_function(ins)
File "D:\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 2478, in __call__
**self.session_kwargs)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "D:\ananconda\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[loss/mul/_3631]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[8,304,104,104] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node decoder_conv0_depthwise_BN/FusedBatchNorm (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:1802) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)
decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)
Input Source operations connected to node decoder_conv0_depthwise_BN/FusedBatchNorm:
decoder_conv0_depthwise/depthwise (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:3522)
decoder_conv0_depthwise_BN/beta/read (defined at \ananconda\lib\site-packages\keras\backend\tensorflow_backend.py:395)
Original stack trace for 'decoder_conv0_depthwise_BN/FusedBatchNorm':
File "/pycharm/Semantic-Segmentation-master/deeplab_Mobile/train.py", line 87, in <module>
model = Deeplabv3(classes=NCLASSES,input_shape=(HEIGHT,WIDTH,3))
File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 124, in Deeplabv3
depth_activation=True, epsilon=1e-5)
File "\pycharm\Semantic-Segmentation-master\deeplab_Mobile\nets\deeplab.py", line 47, in SepConv_BN
x = BatchNormalization(name=prefix + '_depthwise_BN', epsilon=epsilon)(x)
File "\ananconda\lib\site-packages\keras\engine\topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "\ananconda\lib\site-packages\keras\layers\normalization.py", line 181, in call
epsilon=self.epsilon)
File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1827, in normalize_batch_in_training
epsilon=epsilon)
File "\ananconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 1802, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "\ananconda\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 1329, in fused_batch_norm
name=name)
File "\ananconda\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4301, in _fused_batch_norm
name=name)
File "\ananconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "\ananconda\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
op_def=op_def)
File "\ananconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
- 发现是由于显存不足而报的错,将batch_size由8改为4,问题解决,正常运行。
二、keras在代码中实现tensorboard记录绘图
- 方法是在
model.fit_generator()
中加入TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)
,然后再控制台中输入tensorboard --logdir=./logs --host=127.0.0.1 --port=6006
,得到可视化界面。
import tensorflow as tf
from nets.deeplab import Deeplabv3
from keras.utils.data_utils import get_file
from keras.optimizers import Adam
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from PIL import Image
import time
import keras
from keras import backend as K
import numpy as np
import keras.backend.tensorflow_backend as KTF
import os
# 开始训练
model.fit_generator(generate_arrays_from_file(lines[:num_train], batch_size),
steps_per_epoch=max(1, num_train//batch_size),
validation_data=generate_arrays_from_file(lines[num_train:], batch_size),
validation_steps=max(1, num_val//batch_size),
epochs=100,
initial_epoch=0,
# 早停版本
# callbacks=[checkpoint_period, reduce_lr, early_stopping,
# TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])
callbacks = [checkpoint_period, reduce_lr,
TensorBoard(log_dir="./logs", histogram_freq=0, batch_size=4, write_grads=True)])
model.save_weights(log_dir+'last1.h5')
三、cudnn环境变量配置
ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: https://developer.nvidia.com/cuda-90-download-archive
- 在配置好cudnn环境后,建议重启pycharm或者powershell跟新配置,以解决此问题
四、tensorboard报错
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
d:\anaconda\anaconda\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "d:\anaconda\anaconda\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "d:\anaconda\anaconda\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\Anaconda\anaconda\Scripts\tensorboard.exe\__main__.py", line 4, in <module>
File "d:\anaconda\anaconda\lib\site-packages\tensorboard\main.py", line 40, in <module>
from tensorboard import default
File "d:\anaconda\anaconda\lib\site-packages\tensorboard\default.py", line 39, in <module>
from tensorboard.plugins.beholder import beholder_plugin_loader
File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\__init__.py", line 22, in <module>
from tensorboard.plugins.beholder.beholder import Beholder
File "d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py", line 199, in <module>
class BeholderHook(tf.estimator.SessionRunHook):
File "d:\anaconda\anaconda\lib\site-packages\tensorflow\python\util\deprecation_wrapper.py", line 106, in __getattr__
attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow' has no attribute 'estimator'
-
解决方法是在d:\anaconda\anaconda\lib\site-packages\tensorboard\plugins\beholder\beholder.py中,将beholder.py中类class BeholderHook(tf.estimator.SessionRunHook):的括号中内容删除,就没有报错了。
TF2 取消了SESSION,可以不用 -
一定要是python3.6,不然tensorboard会报错!!!!!!!!!!
四、linux配置环境错误
2021-03-13 17:53:43.280913: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281034: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281145: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281253: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281359: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281463: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281569: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2021-03-13 17:53:43.281592: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2021-03-13 17:53:43.281658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-13 17:53:43.281681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2021-03-13 17:53:43.281695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N
2021-03-13 17:53:43.281706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N
- 上述文件都在
/usr/local/cuda-10.0/lib64/
目录下,该目录下的文件下图列出。本来文件都好好的在原来的目录,但是tensorflow死活找不到他们。
- 报这个错之后,程序能跑,但是特别慢,龟速,只用到了cpu。我在修复之前的操作是将
tensorflow-gpu==1.14.0
卸载了,然后安装了一下tensorflow-gpu==1.13.1
。很有可能是这一步改变了tensorflow的某个配置文件,导致它又能够读取到/usr/local/cuda-10.0/lib64/
目录下的配置文件了。 - 接下来我列举一下我认为可能有效的所有的操作:首先我卸载了miniconda3,安装了anaconda,重新安装了
tensorflow-gpu==1.14.0
,keras==2.1.5
,运行程序发现没用。 - 然后我重新打开了
~/.bashrc
文件,将lib64的路径重新复制了一遍:export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/
。~/.bashrc
的配置具体如下:
export PATH="/home/sunqilin/anaconda3/bin:$PATH"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64/
export PATH=$PATH:/usr/local/cuda-10.0/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-10.0
五、报错:AttributeError: ‘str’ object has no attribute ‘decode’
https://github.com/fchollet/deep-learning-models/releases/download/v0.6/mobilenet_1_0_224_tf_no_top.h5
Traceback (most recent call last):
File "/home/user-zhm/sql/Unet_attention/train.py", line 80, in <module>
model.load_weights(weights_path, by_name=True, skip_mismatch=True)
File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 2653, in load_weights
reshape=reshape)
File "/home/user-zhm/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 3407, in load_weights_from_hdf5_group_by_name
original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'
- 解决办法:卸载原来的h5py模块,安装2.10版本
pip install h5py==2.10 -i https://pypi.tuna.tsinghua.edu.cn/simple/
六、源
- 推荐豆瓣源
阿里云 http://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
豆瓣(douban) http://pypi.douban.com/simple/
清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/
中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/
七、多gpu运算报错
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
- tensorflow版本可能有问题,所以将tensorflow从1.14.0版本降到1.13.1版本,keras改到2.2.4版本,此问题得到解决。
八、pytorch==1.2.0安装
# CUDA 10.0
pip install torch==1.2.0 torchvision==0.4.0
九、opencv==3.4.3安装
- opencv3 的安装
conda install opencv-python==3.4.0.12
十、h5py
- 报错
original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'
卸载原来的h5py模块,安装2.10.0版本
pip install h5py==2.10.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
参考
- ModuleNotFoundError: No module named ‘_pywrap_tensorflow_internal’–解决方法
- Win10安装CUDA10和cuDNN
- linux版本cuda安装配置
- AttributeError: ‘str’ object has no attribute ‘decode’
- Tensorflow 缺少 libcusolver.so.10 和 libcudnn.so.8 两个库的解决办法
- AttributeError: ‘str’ object has no attribute ‘decode’
- TensorFlow1.2~2.1各个GPU版本CUDA和cuDNN对应版本整理
- tensorflow-gpu与cuda、keras之间的对应关系