2021-07-12 在 GeForce RTX 3090上配置深度学习环境 cuda 11.1 + tensorflow2.5.0 + python3.8.3

最新推荐文章于 2025-05-11 08:22:54 发布

Julse

最新推荐文章于 2025-05-11 08:22:54 发布

阅读量3.9k

点赞数 3

分类专栏： SeqTMPPI linux 机器学习文章标签： tensorflow 深度学习 pytorch

本文链接：https://blog.csdn.net/Julse/article/details/118686362

版权

机器学习同时被 3 个专栏收录

3 篇文章

订阅专栏

linux

2 篇文章

订阅专栏

SeqTMPPI

1 篇文章

订阅专栏

本博客配置成功的环境已经导出至
https://download.csdn.net/download/Julse/20687132?spm=1001.2014.3001.5501

文章目录

成功安装的细节
问题1 -测试tensorflow是否安装成功
问题2 tensorflow 和tensorlow-gpu
问题3 conda 的多个数据源里面都没有 tensorflow-gpu=2.5.0，但是pip里面有
问题4 tensorflow是gpu版本，keras是否也要指定gpu版本呢？
问题5 tensorflow2.5和keras2.4.3可能不兼容
问题6 cudnn 报错
安装其他版本cuda
未解决的问题
其他问题：

GeForce RTX 3090
配置环境的过程遇到了很多问题，最后成功配置的版本如下，亲测可用

tensorflow-gpu 2.5.0
cudnn 8.1.0.77
python 3.8.3
cuda 11.1

在这里插入图片描述

参考的版本对应关系如图
https://www.tensorflow.org/install/source
在这里插入图片描述

成功安装的细节

安装tensorflow-gpu 2.5.0

conda activate 虚拟环境名字
pip install tensorflow-gpu==2.5.0 # conda install tensorflow-gpu==2.5.0 如果找不到

检查是否安装成功，出现了/device:GPU:0 字眼，放心安装下一步

>>> tf.__version__
'2.5.0'
>>> tf.test.gpu_device_name()

出现如下字样
'/device:GPU:0'
没有再出现skip gpu...


2021-11-21 09:11:11.576578: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-21 09:11:11.586861: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-11-21 09:11:12.301775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-21 09:11:12.302507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:5e:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-21 09:11:12.303282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: 
pciBusID: 0000:b1:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-21 09:11:12.303954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties: 
pciBusID: 0000:d9:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-21 09:11:12.304010: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-11-21 09:11:12.322885: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-11-21 09:11:12.322999: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-11-21 09:11:12.337252: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-11-21 09:11:12.342694: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-11-21 09:11:12.348923: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-11-21 09:11:12.354146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-11-21 09:11:12.356096: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-11-21 09:11:12.362607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
2021-11-21 09:11:12.363212: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-11-21 09:11:17.044150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-21 09:11:17.044213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 2 3 
2021-11-21 09:11:17.044240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N N N N 
2021-11-21 09:11:17.044245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   N N N N 
2021-11-21 09:11:17.044249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2:   N N N N 
2021-11-21 09:11:17.044254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   N N N N 
2021-11-21 09:11:17.050196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:3b:00.0, compute capability: 8.6)
2021-11-21 09:11:17.053391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:1 with 3665 MB memory) -> physical GPU (device: 1, name: GeForce RTX 3090, pci bus id: 0000:5e:00.0, compute capability: 8.6)
2021-11-21 09:11:17.054353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:2 with 3663 MB memory) -> physical GPU (device: 2, name: GeForce RTX 3090, pci bus id: 0000:b1:00.0, compute capability: 8.6)
2021-11-21 09:11:17.055315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:3 with 3771 MB memory) -> physical GPU (device: 3, name: GeForce RTX 3090, pci bus id: 0000:d9:00.0, compute capability: 8.6)
'/device:GPU:0'

安装keras

pip install keras

在激活conda虚拟环境的条件下，tensorflow用pip命令安装，keras也用pip安装，不然conda会再安装一个tensorflow，导致冲突

代码中所有的keras改成tensorflow.keras， keras包其实不再用上了，这个包没必要再装了。

安装 cudnn

conda install -c nvidia cudnn=8.1.0

问题1 -测试tensorflow是否安装成功

虽然有博客说这个报错可以直接忽略，但是亲测gpu无法使用，说明没有安装好

import tensorflow as tf
tf.test.gpu_device_name()

报错信息

I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

解决思路，先查询了一下oneDNN是什么
https://01.org/onednn
后来发现其实就是tensorfow没有安装正确，需要卸载重新安装

看到有文章说可以忽略，但是gpu无法成功使用，仅仅是不把报错信息显示出来而已
https://blog.csdn.net/qq_39096123/article/details/100575784
在这里插入图片描述

问题2 tensorflow 和tensorlow-gpu

官方网站中提到，早期版本二者软件包是分开的，因此就认为直接安装tensorlow 2.5 版本就好了，事实上发现，用cpu编译的tensorflow，gpu上安装不能成功

在这里插入图片描述

参考 https://www.jianshu.com/p/e772b880b4d2

查看tensorflow是否能调用gpu

tf.config.list_physical_devices('GPU')

得到一个空的列表，说明没有找到GPU

tf.test.is_built_with_cuda

在这里插入图片描述

发现直接安装的tensorflow，不是用cuda编译的，也就不能调用gpu

应该安装tensorflow-gpu

问题3 conda 的多个数据源里面都没有 tensorflow-gpu=2.5.0，但是pip里面有

此时版本信息

conda 4.9.2
pip 21.1.3 from /home/username/miniconda3/envs/envnames/lib/python3.8/site-packages/pip (python 3.8)

用pip安装之后，对应的cuda版本没有自动安装好

根据python版本指定tensorflow

在这里插入图片描述

pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-2.4.0-cp38-cp38-manylinux2010_x86_64.whl

安装之后会看到有：tensorflow-gpu 2.4.0
在这里插入图片描述

依然无法成功使用GPU
在这里插入图片描述

问题4 tensorflow是gpu版本，keras是否也要指定gpu版本呢？

keras-gpu
安装keras-gpu用如下指令
在这里插入图片描述
安装之后tensorflow会被conda自动更新
也就是，直接安装keras-gpu就可以了，对应tensorflow-gpu也就自动装好了
但是进入python控制台，发现tensorflow不能用了，可能是因为pip装了一个tensorlow，conda又装了一个

此外，安装的keras-gpu并不能通过import keras导入，无法满足当前程序，因此摈弃这种安装方式
在这里插入图片描述

问题5 tensorflow2.5和keras2.4.3可能不兼容

运行代码时候报错，报错的是keras

在这里插入图片描述

keras和tf.keras关系
在这里插入图片描述

解决：把代码中所有的keras改成tensorflow.keras

问题6 cudnn 报错

Failed to get convolution algorithm. 
This is probably because cuDNN failed to initialize, 
so try looking to see if a warning log message was printed above.

详细信息

Loaded runtime CuDNN library: 8.0.5 
but source was compiled with: 8.1.0.  

CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

安装cudnn 8.1.0即可解决

pip install cudnn
ERROR: Could not find a version that satisfies the requirement cudnn (from versions: none)
ERROR: No matching distribution found for cudnn

conda找到了对应版本
但是默认的版本不符合要求
在这里插入图片描述
最后发现应该输入下面的命令安装正确版本的cudnn
https://anaconda.org/nvidia/cudnn

conda install -c nvidia cudnn

在这里插入图片描述

安装其他版本cuda

服务器配置多版本CUDA、CUdnn(不同Linux账户使用不同CUDA、CUdnn版本）
https://www.cnblogs.com/sddai/p/10278005.html

下载链接
https://developer.nvidia.com/cuda-toolkit-archive
在这里插入图片描述
在官网下载cuda，然后解压，配置环境变量
即是：在用户目录下面的.bashrc 文件末尾，加上这几句，然后source .bashrc 即可

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/cuda/lib64
export PATH=$PATH:/home/user/cuda/bin
export CUDA_HOME=$CUDA_HOME:/home/user/cuda

未解决的问题

全为N的矩阵与部分为Y的矩阵表示的含义，训练模型的时候有无影响
在这里插入图片描述
之前的理解是Y是表示两两之间可以通讯，但是目前全部是N，一个程序能成功调用多块GPU,Y与N目前没有造成影响

其他问题：

安装tensorflow-gpu==2.4的时候找不到文件
在这个网站上找到之后，点击文件详情，复制source_url

conda install <source_url>

即可瞬间安装好tensorflow-gpu

https://anaconda.org/anaconda/tensorflow-gpu/files
在这里插入图片描述

安装好的效果

在这里插入图片描述

安装好之后，import报错

下载这个文件发现，里面只有一些基本信息，没有内容
在这里插入图片描述
不能走这个捷径