Tensorflow 2.2 GPU 安装配置及问题汇总

最新推荐文章于 2024-06-25 04:19:29 发布

bt-harp

最新推荐文章于 2024-06-25 04:19:29 发布

阅读量785

点赞数

文章标签： tensorflow

本文链接：https://blog.csdn.net/weixin_43158831/article/details/111636685

版权

安装流程：Cuda 10.1，cuDNN 7.6.4，Tensorflow 2.2

1. Cuda安装

目前的深度学习框架大都基于NVIDIA 的GPU 显卡进行加速运算，因此需要安装NVIDIA 提供的GPU 加速库CUDA 程序。在安装CUDA 之前，请确认本地计算机具有支持CUDA 程序的NVIDIA 显卡设备，如果计算机没有NVIDIA 显卡，如部分计算机显卡生产商为AMD，以及部分MacBook 笔记本电脑，则无法安装CUDA 程序，因此可以跳过这一步，直接进入TensorFlow 安装。CUDA 的安装分为CUDA 软件的安装、cuDNN 深度神经网络加速库的安装和环境变量配置三个步骤，安装稍微繁琐，请读者在操作时思考每个步骤的原因，避免死记硬背流程。

打开CUDA 程序的下载官网：https://developer.nvidia.com/cuda-10.0-download-archive，这里我们使用CUDA 10.1 版本，依次选择Windows 平台，x86_64 架
构，10 系统，exe(local)本地安装包，再选择Download 即可下载CUDA 安装软件。下载完成后，打开安装软件。选择”Custom”选项，点击NEXT 按钮进入安装程序选择列表，在这里选择需要安装和取消不需要安装的程序。在CUDA 节点下，取消”Visual Studio Integration”一项；在“Driver components”节点下，比对目前计算机已经安装的显卡驱动“Display Driver”的版本号“Current Version”和CUDA 自带的显卡驱动版本号“New Version”，如果“Current Version”大于“New Version”，则需要取消“Display Driver”的勾，如果小于或等于，则默认勾选即可。设置完成后即可正常安装完成。

安装完成后，我们来测试CUDA 软件是否安装成功。打开cmd 命令行，输入“nvcc -V”，即可打印当前CUDA 的版本信息，如图 1.29 所示，如果命令无法识别，则说明安装失败。同时我们也可从CUDA 的安装路径“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin”下找到“nvcc.exe”程序。

2. cuDNN 神经网络加速库安装

CUDA 并不是针对于神经网络设计的GPU 加速库，它面向各种需要并行计算的应用设计。如果希望针对于神经网络应用加速，需要额外安装cuDNN 库。需要注意的是，cuDNN 库并不是运行程序，只需要下载解压cuDNN 文件，并配置Path 环境变量即可。打开网址https://developer.nvidia.com/cudnn，选择“Download cuDNN”，由于NVIDIA公司的规定，下载cuDNN 需要先登录，因此用户需要登录或创建新用户后才能继续下载。登录后，进入cuDNN 下载界面，勾选“I Agree To the Terms of the cuDNN SoftwareLicense Agreement”，即可弹出cuDNN 版本下载选项。我们选择CUDA 10.1 匹配的cuDNN 版本，并点击“cuDNN Library for Windows 10”链接即可下载cuDNN 文件。需要注意的是，cuDNN 本身具有一个版本号，同时它还需要和CUDA 的版本号对应上，不能下错不匹配CUDA 版本号的cuDNN 文件。推荐下载cuDNN7.6.4，实测可用。

下载完成cuDNN 文件后，解压并进入文件夹，我们将名为“cuda”的文件夹重命名为“cudnn764”，并复制此文件夹。进入CUDA 的安装路径C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0，粘贴“cudnn764”文件夹即可，此处可能会弹出需要管理员权限的对话框，选择继续即可粘贴。

环境变量 Path 配置 上述cudnn 文件夹的复制即已完成cuDNN 的安装，但为了让系统能够感知到cuDNN 文件的位置，我们需要额外配置Path 环境变量。打开文件浏览器，在“我的电脑”上右击，选择“属性”，选择“高级系统属性”，选择“环境变量”，如图
1.32。在“系统变量”一栏中选中“Path”环境变量，选择“编辑”，如图 1.33 所示。选择
“新建”，输入我们cuDNN 的安装路径“C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v10.0\cudnn764\bin”，并通过“向上移动”按钮将这一项上移置顶。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hjgVjuKI-1608797509794)(C:\Users\btharp\AppData\Roaming\Typora\typora-user-images\image-20201224145052078.png)]

CUDA 安装完成后，环境变量中应该包含“C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v10.0\bin”，“C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v10.0\libnvvp”和“C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v10.0\cudnn764\bin”三项，具体的路径可能依据实际路径略有出入，确认无误后依次点击确定，关闭所有对话框。

3. Tensorflow 安装

TensorFlow 和其他的Python 库一样，使用Python 包管理工具pip install 命令即可安装。安装TensorFlow 时，需要根据电脑NVIDIA GPU 显卡来确定是安装性能更强的GPU 版本还是性能一般的CPU 版本。国内清华源安装GPU版命令如下：

pip install -U tensorflow-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple

上述命令自动下载TensorFlow GPU 版本并安装，目前是TensorFlow 2.3 正式版，“-U” 参数指定如果已安装此包，则执行升级命令。

现在我们来测试GPU 版本的TensorFlow 是否安装成功。在cmd 命令行输入ipython 进入ipython 交互式终端，输入“import tensorflow as tf”命令，如果没有错误产生，继续输入“tf.test.is_gpu_available()”测试GPU 是否可用，此命令会打印出一系列以“I”开头的信息(Information)，其中包含了可用的GPU 显卡设备信息，最后会返回“True”或者“False”，代表了GPU 设备是否可用。如果为True，则TensorFlow GPU版本安装成功；如果为False，则安装失败，需要再次检测CUDA，cuDNN，环境变量等步骤，或者复制错误，从搜索引擎中寻求帮助。个人亲测，提示false大部分情况下是cuDNN版本没对应上。

运行问题总结

1. failed to find the dnn implementation

问题提示大概如下所示

UnknownError:    Fail to find the dnn implementation.
     [[{{node CudnnRNN}}]]
     [[sequential_18/gru_36/PartitionedCall]] [Op:__inference_train_function_54574]

Function call stack:
train_function -> train_function -> train_function

常见于用keras 运行LSTM模型的时候出现这个问题。在csdn和stack overflow以及google上查了多次，最终找到合适的答案是cudnn版本不对。比如这个答案所说的：

For anyone experiencing this issue with TF2.0 and Cuda 10.0 with cuDNN-7, you are likely getting this because you accidentally upgraded cuDNN from 7.6.2 to something >7.6.5. Despite the TF docs stating that anything >=7.4.1 is working, this is not the case! Downgrade to CudNN as follows:
sudo apt-get install --no-install-recommends \
cuda-10-0 \
libcudnn7=7.6.2.24-1+cuda10.0  \
libcudnn7-dev=7.6.2.24-1+cuda10.0

附上版本对应截图：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dJRhbJhn-1608797509802)(C:\Users\btharp\AppData\Roaming\Typora\typora-user-images\image-20201224152009980.png)]

2. cuDNN failed to intialized

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

同样可能是由于cudnn版本对应错误，可重新修改cudnn，即找到对应版本cudnn，重复安装流程第二步。

此外也有可能是gpu显存问题。tensorfow在执行过程中会默认使用全部GPU内存，造成内存溢出。常见的解决方法（对应于tf2有如下几种）：

2.1 亲测合适我的情况

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
tf.config.set_soft_device_placement(True)
tf.debugging.set_log_device_placement(True)

2.2 Stack Overflow上的回答

from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
sess = tf.Session(config=config)
set_session(sess)  # set this TensorFlow session as the default session for Keras

2.3 Stack Overflow上的回答2

# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

2.4 忘了出处

import tensorflow as tf
tf.config.set_soft_device_placement(True)
tf.debugging.set_log_device_placement(True)

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)

2.5 忘了出处2，用tf1的方法

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

config = tf.compat.v1.ConfigProto(allow_soft_placement=True)
 
config.gpu_options.per_process_gpu_memory_fraction = 0.3
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))