Tensorflow和Nvidia驱动、cuda和cudnn的版本协调:
(https://www.tensorflow.org/install/source#tested_build_configurations)
一、安装驱动
ubuntu本身拥有开源的集成显卡驱动程序nouveau,先屏蔽nouveau,再安装NVIDIA官方驱动。
查看属性
ls -lh /etc/modprobe.d/blacklist.conf
查看是否屏蔽了nouveau(命令没有输出就行)
lsmod | grep nouveau
屏蔽nouveau的方法:
1)sudo gedit /etc/modprobe.d/blacklist.conf
2)在最后一行添加:
blacklist nouveau
options nouveau modeset=0
3)执行:sudo update-initramfs -u
4)重启生效:reboot
查看GPU型号
lspci | grep -i nvidia
禁用x-windows服务:
sudo /etc/init.d/lightdm stop (或sudo service lightdm stop)
查看nvidia驱动版本
dpkg --list | grep nvidia-*
卸载原有的nvidia驱动
apt-get remove --purge nvidia*
正式安装命令:
./NVIDIA-Linux-x86_64-390.77.run -no-opengl-files -no-nouveau-check -no-x-check
./NVIDIA-Linux-x86_64-390.77.run -no-opengl-files
(–no-opengl-files 只安装驱动文件,不安装OpenGL文件)
启动x服务:
sudo /etc/init.d/lightdm start (或sudo service lightdm start)
二、cudn安装(相关补丁类似)
什么是cuda?为什么有了nvidia驱动、cuda和cudnn三者之间的关系?
1)CUDA(Compute Unified Device Architecture,统一并行计算架构)是由NVIDIA所推出的一种集成技术。
下载页面:
step1:
cd /data/bigData/nvidia_driver_390.77 -- 自己的个人安装文件目录
chmod +x ./cuda_9.0.176_384.81_linux.run
sh ./cuda_9.0.176_384.81_linux.run
step2:
export PATH=/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH
source ~/.bashrc
step3:cuda是否安装通过
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery
注意点:
1、nvidia官网下载的cuda文件类型选择runfile(local)
2、cuda安装时候,会提示是否安装cuda内部的确定,切记选择no!
三、cudnn安装(假定cudn已经安装在/usr/local/cuda/目录下)
下载页面:
执行命令:
cp cudnn-9.0-linux-x64-v7.solitairetheme8 cudnn-9.0-linux-x64-v7.tgz
tar -xvf cudnn-9.0-linux-x64-v7.tgz //解压结果位于当前目录的cuda目录下
在解压的目录下(当前目录的cuda目录):
cp cuda/include/*.h /usr/local/cuda/include/
cp cuda/lib64/lib* /usr/local/cuda/lib64/
(以下的3步在重新安装cudnn时可以省略)【so文件为什么需要建立软连接???】
chmod +r libcudnn.so.7.0.5
ln -s libcudnn.so.7.0.5 libcudnn.so.7
ln -s libcudnn.so.7 libcudnn.so
ldconfig -- 立刻生效动态链接库
Note:
ldconfig是一个动态链接库管理命令,实现动态链接库的系统共享。
禁止ubutnu系统自动更新
less /etc/apt/apt.conf.d/10periodic
检查系统内核版本
uname -sr
遇到的几个问题
1)安装nvidia驱动之后,输入nvidia-smi,没有输出显卡的相关信息
解决方案:重新安装nvidia驱动,中间有一步提示“是否restart x”,选择“yes”
2)"Would you like to register the kernel module sources with DKMS?This will allow DKMS to auomatically build a new module,if you install a different kernel later"问题:
选择 NO!
3)“Loaded runtime CuDNN library: 7101 (compatibility version 7100)”的cudnn版本问题:
解决方案:重新安装cudnn,需要在官网下载对应的v7.0的驱动(笔者安装的7.04),可以解决问题
4)频繁的调用和暂停显卡,比如频繁使用nvidia-smi,会导致rpa-**问题
解决方案:未找到解决方法,查阅资料说可能是显卡本身的硬件问题。
检查tensorflow能否正确使用显卡:
import os
import tensorflow as tf
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-08-29 09:51:46.603464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-29 09:51:46.603878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
2019-08-29 09:51:46.603923: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-29 09:51:46.603935: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-29 09:51:46.603944: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-29 09:51:46.603954: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-29 09:51:46.603963: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-29 09:51:46.603973: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-29 09:51:46.603983: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-29 09:51:46.604021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-29 09:51:46.604458: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-29 09:51:46.604842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-29 09:51:46.604861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-29 09:51:46.604866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-08-29 09:51:46.604873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-08-29 09:51:46.605019: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-29 09:51:46.605430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-29 09:51:46.605838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10468 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1
2019-08-29 09:51:46.605881: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.
如上显示,说明显卡可以正常使用~
-- over --