1.先查看是否安装nvidia的驱动
nvidia-smi
如果报错 -bash: nvidia: command not found
则登陆 nvidia 驱动官方下载
找到对应版本下载后
sudo chmod +x NVIDIA-Linux-x86_64-470.63.01.run
sudo sh NVIDIA-Linux-x86_64-470.63.01.run
报错:
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.
参考博客
报错:
Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the ‘kernel-source’ or ‘kernel-devel’ RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the ‘–kernel-source-path’ command line option.
参考 centos7.5英伟达驱动问题
参考
1.CentOS7–manually upgrade the kernel to the specified version
2.lernel-devel下载
3.ERROR: Unable to find the kernel source tree
4.kernel devel 安装与卸载
rpm -qa | grep -E “kernel-devel|kernel-headers”
发现 kernel-devel和kernel-headers 版本不一致 ,通过 参考1 和参考2 下载对应kernel-devel版本 放入服务器
yum localinstall kernel-devel-3.10.0-957.27.2.el7.x86_64.rpm
删除多余的kernel-devel
yum remove kernel-devel-3.10.0-1160.41.1.el7.x86_64
再次开始
sudo sh NVIDIA-Linux-x86_64-470.63.01.run
报warning:
WARNING: nvidia-installer was forced to guess the X library path ‘/usr/lib64’ and X module path ‘/usr/lib64/xorg/modules’; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the
pkg-config
utility and the X.Org SDK/development package for your distribution and reinstall the driver.
不用管,一直 yes。
最后使用 nvidia-smi 查看。
成功!!
2.查看 CUDA 版本:
cat /usr/local/cuda/version.txt
或者:
nvcc -V
查看 CUDNN 版本:
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
查看能否使用gpu:
jupyter输入:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
如果版本不对:
Num GPUs Available: 0
从tensorflow官方地址查看对应版本TF CUDA和cudnn
https://tensorflow.google.cn/install/source
经查阅:
tensorflow2.0.0需要安装cuda10.0和cudnn7.6:
2.安装cuda10.0
卸载之前的cuda(可选)
cd /usr/local/cuda-XX/bin
sudo ./uninstall_cuda_toolkit_XX.pl
下载对应的cuda10.0:
下载地址:https://developer.nvidia.com/cuda-toolkit-archive
首先查看Linux系统版本:
cat /etc/redhat-release
显示为 CentOS Linux release 7.7
再看架构:
uname -a
显示为: x86_64
下载对应版本:
拷贝到服务器上,进行安装:
sudo chmod +x cuda_10.1.105_418.39_linux.run
sudo sh cuda_10.1.105_418.39_linux.run
选项参考:
https://www.freesion.com/article/6641492348/
报错:
2. An NVIDIA kernel module ‘nvidia-drm’ appears to already be loaded in your kernel…
安装驱动时报的错误。
解决方案:
sudo service lightdm stop
禁用图形目标
sudo systemctl isolate multi-user.target
卸载Nvidia驱动程序
modprobe -r nvidia-drm
安装完毕查看:
cat /usr/local/cuda/version.txt
显示:CUDA Version 10.0.130
加入环境变量
sudo vim ~/.bashrc
添加:
export PATH="/usr/local/cuda-10.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH"
source ~/.bashrc
查看:
nvcc -V
结果:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
安装CUDA完毕。
3.安装cudnn-10.0
官网下载:https://developer.nvidia.com/rdp/cudnn-archive
选择对应版本
我选择的是: cudnn-10.0-linux-x64-v7.6.5.32.tgz
拷贝到服务器:
tar -xvf cudnn-10.0-linux-x64-v7.6.5.32.tgz
解压后 出现一个cuda 文件夹
拷贝:
sudo cp cuda/include/cudnn.h /usr/local/cuda-10.0/include # 填写对应的版本的cuda路径
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64 # 填写对应的版本的cuda路径
sudo chmod a+r /usr/local/cuda-10.0/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*
查看:cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
cudnn 安装完成!
4.jupyter内查看GPU是否可用
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices(“GPU”)))
4.tensorflow2.0使用gpu
import tensorflow as tf
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
如果打印出现GPU和CPU 则使用了GPU
只出现CPU 则未启动GPU
5.查看keras与tensorflow对应关系和当前版本
对应关系查询网址:点击这里
tensorflow2.0对应keras版本为2.3.1
import keras
print(keras.__version__)
显示版本为:2.2.5
重新安装:
cd xxx/xxx/anaconda3/bin
./pip install keras==2.3.1
安装完毕!
参考:
https://www.freesion.com/article/9245510937/
https://blog.csdn.net/sinat_23619409/article/details/84202651
https://blog.csdn.net/kingfoulin/article/details/98872965