由于昨晚一晚上CPU才跑了3000步,不得已卸载cpu版本的tf,没想到一天就耗光了~
参考:https://www.cnblogs.com/fanfzj/p/8521728.html
https://blog.csdn.net/liangyihuai/article/details/78688228
pip uninstall tensorflow 很顺畅就卸载完了,于是下载了CUDA(V9.0https://developer.nvidia.com/cuda-toolkit-archive)和CUDNN(7.0,https://developer.nvidia.com/rdp/cudnn-archive需登录官网注册),安装设置PATH以及各种转移文件后,安装GPU版本的tf,注意适配版本,1.4.0是不适配的(泪了https://blog.csdn.net/yeler082/article/details/80943040版本排查)。于是再次卸载安装了1.9.0版:pip install --upgrade https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-1.9.0-cp35-cp35m-win_amd64.whl
测试安装成功:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
测试是否有gpu加速:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
在训练自己的训练集时遇到的阻碍:
1.Hint: If you want to see a list of allocated tensors when OOM happens:I have seen OOMs happen several epochs into training in tensorflow, my best guess is that if your model is at the borderline of using all the GPU memory then internal memory allocation issues such as memory fragmentation or the how temporary RAM is being used can make the model OOM even if it was able to train on a few epochs. The only thing to do is reduce RAM usage by reducing the batch size or using a smaller model and trying again.
解决:GPU占有率太高,可调整减小batch_size或者hidden_layer中的单元数(查看gpu占有率方法:https://blog.csdn.net/weixin_41770169/article/details/80349088)
2. AttributeError: module 'tensorflow' has no attribute 'constant'
解决:版本冲突,或者安装未成功,重装吧
3. AttributeError: module 'tensorflow' has no attribute 'init_scope'
解决:安装新版本的tensorflow后,由于库里面函数名更新修改了,产生的。我是由于之前训练的模型中调用了老版本的该函数导致报错,于是将模型删除重新训练后解决。
4. 训练一段时间后出现'Nan in summary histogram for'
解决:初始权重过大,可一开始把学习率调小点即可
5. Nan in summary histogram
解决:https://blog.csdn.net/v1_vivian/article/details/77991894很有道理,可惜我不是这个错误
其他试错:https://blog.csdn.net/weixin_34004750/article/details/87479297
还待解决问题:上面的5