tensorflow keras 版本_GeForce RTX 3090--tensorflow开源asr项目采坑

背景

采购新的GPU后,就有赶紧尝鲜的冲动 GeForce RTX 3090

项目尝试

准备尝试ASR中文项目:

https://github.com/nl8590687/ASRT_SpeechRecognition​github.com

然后天真以为很香、很简单。根据项目的介绍,安装了tensorflow 1.13。再根据一些中文网和tensorflow的提示, 安装了 cuda10,cudnn7.6 噩梦从此开始。方向走错了,然后疯狂弥补错误只会越走越远,发现少了各类dll文件,开始网上搜罗。

比如缺少各种 cudart64_100.dll 这类文件,甚至还找到下面这个资源,下载文件配置环境变量,一切以为正常了。 资料链接 https://download.mersenne.ca/CUDA-DLLs/CUDA-10.0

配置文件,下载cudnn、cuda 几乎用了一整天时间。然后程序运行一下午,几乎慢到蜗牛一样,打开任务管理器一看。GPU使用5%, ……………………,内心收到一万点伤害。

最后使用了 cuda_11.1 和 cudnn-v8.0430 版本。 同时使用了tensorflow较新的版本。修改了keras 的源码支持了GPU运行。 具体踩坑如下

各类出错

a、windows numpy 版本报错

 fails to pass a sanity check due to a bug in the windows runtime. See this issue for more informati

解决问题 > pip install numpy==1.19.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

b、各类dll文件缺失

ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: CUDA Toolkit 9.0 Downloads

解决问题:

下载 cuda 10 tensorflow 1.14.1 然后依然出现 问题。版本不对,要人命, 因为3090比较新所以按照很多人提示 也查了tensorflow对应的版本,CAX2无法使用问题,安装了2.3.0等版本,依然不行。

c、运行测试GPU

import tensorflow as tf
tf.test.is_gpu_available()

依然返回false

直到下载了最新的cuda_11.1 和 cudnn-v8.0430,并将解压的cudnn文件进行path环境变量,终于看到了曙光,但是依然爆出CPU无法支持。

3090需要使用cuda 11. 重新删除cuda10

这里注意删除时候,需要将所有带10版本的cuda软件均删掉【控制面板--程序--删除程序】

d、安装 tf-nightly-gpu

import tensorflow as tf
tf.test.is_gpu_available()

成功返回true

e、运行 python train_mspeech.py, 直接挂

 File "D:ASR_projectasrSpeechModel251.py", line 44, in __init__
    self._model, self.base_model = self.CreateModel()
  File "D:ASR_projectasrSpeechModel251.py", line 73, in CreateModel
    layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
  File "D:ASR_projectasrvenvlibsite-packageskerasbackendtensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "D:ASR_projectasrvenvlibsite-packageskerasenginebase_layer.py", line 446, in __call__
    self.assert_input_compatibility(inputs)
  File "D:ASR_projectasrvenvlibsite-packageskerasenginebase_layer.py", line 310, in assert_input_compatibility
    K.is_keras_tensor(x)
  File "D:ASR_projectasrvenvlibsite-packageskerasbackendtensorflow_backend.py", line 695, in is_keras_tensor
    if not is_tensor(x):
  File "D:ASR_projectasrvenvlibsite-packageskerasbackendtensorflow_backend.py", line 703, in is_tensor
    return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

传说中的 tensorflow版本之间的不兼容问题?

开始尝试进行更新 keras版本,几乎崩溃。 只能下手去改源码,折磨,去github tensorflow issue 中寻找解决方案

NMazzatenta commented on 27 Apr • 
I had the same issue. TF 2.1 built from source + keras 2.3.1 in conda environment. Solved by modifying file "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py" at line 704.
Before:
return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
After:
return isinstance(x, tf_ops._TENSOR_LIKE_TYPES) or tf_ops.is_dense_tensor_like(x)

Don't know if it is the right thing to do, but I got things running in this way.


依然报错,点击进入ops.py 文件发现没有对应的属性 _TensorLike。 修改源码如下解决

f0fc5dfec6d6525bbaac8b20eedf03da.png

终于解决了,但是我已经淡定了,知道肯定会有其他代码问题,果然没让我失望。

f、爆出错误

WARNING:tensorflow:From train_mspeech.py:23: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

Traceback (most recent call last):
  File "D:ASR_projectasrvenvlibsite-packageskerasenginebase_layer.py", line 310, in assert_input_compatibility
    K.is_keras_tensor(x)
  File "D:ASR_projectasrvenvlibsite-packageskerasbackendtensorflow_backend.py", line 697, in is_keras_tensor
    str(type(x)) + '`. '
ValueError: Unexpectedly found an instance of type `<class 'tensorflow.python.keras.engine.keras_tensor.KerasTensor'>`. Expected a symbolic tensor instance.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_mspeech.py", line 46, in <module>
    ms = ModelSpeech(datapath)
  File "D:ASR_projectasrSpeechModel251.py", line 44, in __init__
    self._model, self.base_model = self.CreateModel()
  File "D:ASR_projectasrSpeechModel251.py", line 73, in CreateModel
    layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
  File "D:ASR_projectasrvenvlibsite-packageskerasbackendtensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "D:ASR_projectasrvenvlibsite-packageskerasenginebase_layer.py", line 446, in __call__
    self.assert_input_compatibility(inputs)
  File "D:ASR_projectasrvenvlibsite-packageskerasenginebase_layer.py", line 316, in assert_input_compatibility
    str(inputs) + '. All inputs to the layer '
  File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonkerasenginekeras_tensor.py", line 332, in __repr__
    layer = self._keras_history.layer
AttributeError: 'tuple' object has no attribute 'layer'

要替换成tensorflow自带的 keras, ok,替换全文开始

AttributeError: ‘tuple‘ object has no attribute ‘layer‘​blog.csdn.net
758bd56f13574761d007e99d3e061e13.png

g、运行项目时候竟然出现了 out of memory。 刚购买的 3090 应该不太可能。 调整一下GPU参数。

config.gpu_options.per_process_gpu_memory_fraction = 0.95
# config.gpu_options.allow_growth=True #不全部占满显存, 按需分配?

batch_size 到64改小

h、真香,速度飞快开始训练

508f12b7c5448497d5e850c12a9b88a3.png

看到GPU使用起来了,特别开心,终于训练速度直线上升,比起刚开始CPU让人激动ing。

然后 一轮终于跑完,然后 还是跪了。

alhost/replica:0/task:0/device:GPU:0 with 23347 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2020-11-10 09:56:23.572753: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Traceback (most recent call last):
  File "train_mspeech.py", line 49, in <module>
    ms.TrainModel(datapath, epoch = 50, batch_size = 32, save_step = 500)
  File "D:ASR_projectasrSpeechModel251.py", line 187, in TrainModel
    self.TestModel(self.datapath, str_dataset='train', data_count = 4)
  File "D:ASR_projectasrSpeechModel251.py", line 250, in TestModel
    pre = self.Predict(data_input, data_input.shape[0] // 8)
  File "D:ASR_projectasrSpeechModel251.py", line 326, in Predict
    r1 = r[0][0].eval(session=tf.compat.v1.Session())
  File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonframeworkops.py", line 1258, in eval
    "eval is not supported when eager execution is enabled, "
NotImplementedError: eval is not supported when eager execution is enabled, is .numpy() what you're looking for?

大大的几个字, what you're looking for? 扎心!!!

继续检索,据说增加这个可以搞定,

tf.compat.v1.disable_eager_execution()

跑起来了,终于要去见证奇迹了,然后, out of memory,死机了,死机了!!

……………………心态差点爆炸………………

继续踩坑,修改batch_size 改到16一点。

增加 onfig.gpu_options.allow_growth=True

e249c917abd9c492018cbae90e002a14.png

2020-11-16 更新如下:

模型每次运行超过8个小时左右出现,

报错:
Traceback (most recent call last):
File "train_mspeech.py", line 53, in
ms.TrainModel(datapath, epoch = 50, batch_size = 16, save_step = 500)
File "D:ASR_projectasrSpeechModel251.py", line 187, in TrainModel
self.TestModel(self.datapath, str_dataset='train', data_count = 4)
File "D:ASR_projectasrSpeechModel251.py", line 250, in TestModel
pre = self.Predict(data_input, data_input.shape[0] // 8)
File "D:ASR_projectasrSpeechModel251.py", line 326, in Predict
r1 = r[0][0].eval(session=tf.compat.v1.Session())
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonframeworkops.py", line 921, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonframeworkops.py", line 5515, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 968, in run
run_metadata_ptr)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 1369, in _do_run
run_metadata)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 1375, in _do_call
return fn(*args)
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 1358, in _run_fn
self._extend_graph()
File "D:ASR_projectasrvenvlibsite-packagestensorflowpythonclientsession.py", line 1398, in _extend_graph
tf_session.ExtendSession(self._session)
MemoryError: bad allocation

# 限制内存泄漏 2020-11-16,搜索网上,然而修改如下,
config.inter_op_parallelism_threads=1
config.intra_op_parallelism_threads=1

然而没有解决问题, 现在再继续跟进,问过github作者,他们未出现此问题,可能为tf版本导致,继续定位中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值