文章目录
1.Install NVIDIA Graphics Driver via runfile
1.1卸载之前的老版本:
zutnlp@YQ2:/opt/nvidia$ sudo apt-get purge nvidia*
1.2下载cuda Driver
https://www.nvidia.com/Download/index.aspx?lang=en-us
应为之前已经有机器下载了,只需要scp过去即可:
zutnlp@YQ1:~/.ssh$ sudo scp zutnlp@10.63.3.31:~/Downloads/NVIDIA-Linux-x86_64-418.87.01.run zutnlp@10.63.3.32:/opt/nvidia
如果遇到 scp 权限问题,请修改提示文件的权限为777即:
zutnlp@YQ2:/opt$ sudo chmod 777 nvidia/
[sudo] password for zutnlp:
1.3禁用Nouveau Driver
因为之前操作过进行过设置,所以几乎不需要任何操作,如果不是下列形式,请使用vim 进行编辑并且添加内容如下。
zutnlp@YQ2:~$ cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modset=0
1.4关闭x-server 相关服务
sudo service lightdm stop
1.5执行run file
不知道什么原因-dkms 报一下错误,但是如果不再添加这个参数-dkms ,以后kenel更新就会造成驱动失效,需要重新装驱动,最后还是找到问题了,由于缺少依赖造成的:
zutnlp@YQ2:/opt/nvidia$ sudo sh NVIDIA-Linux-x86_64-418.87.01.run -s --dkms --no-opengl-files
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.87.01..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ERROR: Failed to find dkms on the system!
ERROR: Failed to install the kernel module through DKMS. No kernel module was
installed; please try installing again without DKMS, or check the DKMS
logs for more information.
ERROR: Installation has failed. Please see the file
'/var/log/nvidia-installer.log' for details. You may find suggestions
on fixing installation problems in the README available on the Linux
driver download page at www.nvidia.com.
最后得知是少了一些依赖,所以保险起见增加一些依赖库如下:
sudo apt-get update
sudo apt-get install dkms build-essential linux-headers-generic
sudo apt-get install gcc-multilib xorg-dev
sudo apt-get install freeglut3-dev libx11-dev libxmu-dev install libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev
然后重新执行后成功:
zutnlp@YQ2:/opt/nvidia$ sudo sh NVIDIA-Linux-x86_64-418.87.01.run -s --dkms --no-opengl-files
参数说明:
--no-opengl-files:表示只安装驱动文件,不安装OpenGL文件。这个参数不可省略,否则会导致登陆界面死循环,英语一般称为”login loop”或者”stuck in login”。 **必选参数解释**:因为NVIDIA的驱动默认会安装OpenGL,而Ubuntu的内核本身也有OpenGL、且与GUI显示息息相关,一旦NVIDIA的驱动覆写了OpenGL,在GUI需要动态链接OpenGL库的时候就引起问题。
–no-x-check:表示安装驱动时不检查X服务,非必需,我们已经禁用图形界面。
–no-nouveau-check:表示安装驱动时不检查nouveau,非必需,我们已经禁用驱动。
-Z, –disable-nouveau:禁用nouveau。此参数非必需,因为之前已经手动禁用了nouveau。
-A:查看更多高级选项。
-dkms( 建议开启 ) 在 kernel 自行更新时将驱动程序安装至模块中,从而阻止驱动程序重新安装。** 在 kernel 更新期间,dkms 触发驱动程序重编译至新的 kernel 模块堆栈。
-s is used for silent installation which should used for batch installation. For installation on a single computer, this option should be turned off for more installtion information.
以下证明驱动安装成功:
zutnlp@YQ2:/opt/nvidia$ nvidia-smi
Sun Nov 17 18:47:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 29W / 250W | 0MiB / 12198MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 35C P0 24W / 250W | 0MiB / 12198MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2.install cuda
2.1 具体操作步骤
https://developer.nvidia.com/cuda-downloads
cuda 安装比较简单,只需要执行run 脚本即可:
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
sudo sh cuda_10.1.243_418.87.00_linux.run
应为文件比较大,所以建议从舆情1拷贝到这个机器是比较快的:
zutnlp@YQ1:~/.ssh$ scp /usr/local/cuda_10.1.243_418.87.00_linux.run zutnlp@10.63.3.32:/opt/nvidia
以下是这个过程:
────────────────────────────────────────────────────────────────────────────┐
│ End User License Agreement │
│ -------------------------- │
│ │
│ │
│ Preface │
│ ------- │
│ │
│ The Software License Agreement in Chapter 1 and the Supplement │
│ in Chapter 2 contain license terms and conditions that govern │
│ the use of NVIDIA software. By accepting this agreement, you │
│ agree to comply with all the terms and conditions applicable │
│ to the product(s) included herein. │
│ │
│ │
│ NVIDIA Driver │
│ │
│ │
│ Description │
│ │
│ This package contains the operating system driver and │
│──────────────────────────────────────────────────────────────────────────────│
│ Do you accept the above EULA? (accept/decline/quit): │
│ accept │
空格选中,前方有x 代表选中,此处只选中cuda toolkit 即可:
│ CUDA Installer │
│ - [ ] Driver │
│ [ ] 418.87.00 │
│ + [X] CUDA Toolkit 10.1 │
│ [ ] CUDA Samples 10.1 │
│ [ ] CUDA Demo Suite 10.1 │
│ [ ] CUDA Documentation 10.1 │
│ Options │
│ Install
选中yes
──────────────────────────────────────────────────────────────────────────────┐
│ A symlink already exists at /usr/local/cuda. Update to this installation? │
│ Yes │
│ No
等待一会出现如下代表安装完成:
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Not Selected
Please make sure that
- PATH includes /usr/local/cuda-10.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 418.00 is required for CUDA 10.1 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
输入以下命令验证是否成功:
zutnlp@YQ2:/opt/nvidia$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.243
2.2 配置运行库
sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig
把 /usr/local/cuda/bin 添加到系统的环境变量path 中,使用以下命令:
zutnlp@YQ2:/opt/nvidia$ vim /etc/environment
zutnlp@YQ2:/opt/nvidia$ sudo vim /etc/environment
zutnlp@YQ2:/opt/nvidia$ source /etc/environment
2.3 解决nvcc -V 没有起作用
发现已经升级了cuda 但是nvcc 还是显示以前的cuda 9.0:
zutnlp@YQ2:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
所以需要如下代码添加到 /etc/profile
中末尾中,操作完成后如下:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
操作完成后可以查看如下:
zutnlp@YQ2:~$ cat /etc/profile
# /etc/profile: system-wide .profile file for the Bourne shell (sh(1))
# and Bourne compatible shells (bash(1), ksh(1), ash(1), ...).
if [ "$PS1" ]; then
if [ "$BASH" ] && [ "$BASH" != "/bin/sh" ]; then
# The file bash.bashrc already sets the default PS1.
# PS1='\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
else
if [ "`id -u`" -eq 0 ]; then
PS1='# '
else
PS1='$ '
fi
fi
fi
if [ -d /etc/profile.d ]; then
for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
. $i
fi
done
unset i
fi
xrandr --newmode 1920x1080 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync
xrandr --addmode VGA-1 1920x1080
xrandr --output VGA-1 --mode 1920x1080
export PYTHONPATH=/home/yuqing/data/tf/models:/home/zutnlp/quanyou.chang/models/tf/slim
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
操作步骤如下:
zutnlp@YQ2:~$ vim /etc/profile
zutnlp@YQ2:~$ sudo vim /etc/profile
[sudo] password for zutnlp:
zutnlp@YQ2:~$ source /etc/profile
Can't open display
Can't open display
Can't open display
这时可以测试是否配置成功如下:
zutnlp@YQ2:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
3.install cuDnn
3.1到官网下载选择对应的版本,注意有三个包:
https://developer.nvidia.com/rdp/cudnn-download
zutnlp@YQ1:~/wueryong/nvidia/cuDNNN$ ll
total 336524
drwxrwxr-x 2 zutnlp zutnlp 4096 11月 17 15:50 ./
drwxrwxr-x 3 zutnlp zutnlp 4096 11月 17 15:50 ../
-rw-r--r-- 1 zutnlp zutnlp 180962466 11月 16 21:19 libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
-rw-r--r-- 1 zutnlp zutnlp 159185716 11月 16 21:18 libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
-rw-r--r-- 1 zutnlp zutnlp 4428908 11月 16 21:03 libcudnn7-doc_7.6.5.32-1+cuda10.1_amd64.deb
3.2 执行安装命令
zutnlp@YQ1:~/wueryong/nvidia/cuDNNN$ sudo dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
Selecting previously unselected package libcudnn7.
(Reading database ... 351236 files and directories currently installed.)
Preparing to unpack libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb ...
Unpacking libcudnn7 (7.6.5.32-1+cuda10.1) ...
Setting up libcudnn7 (7.6.5.32-1+cuda10.1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
zutnlp@YQ1:~/wueryong/nvidia/cuDNNN$ sudo dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
Selecting previously unselected package libcudnn7-dev.
(Reading database ... 351242 files and directories currently installed.)
Preparing to unpack libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb ...
Unpacking libcudnn7-dev (7.6.5.32-1+cuda10.1) ...
Setting up libcudnn7-dev (7.6.5.32-1+cuda10.1) ...
update-alternatives: using /usr/include/x86_64-linux-gnu/cudnn_v7.h to provide /usr/include/cudnn.h (libcudnn) in auto mode
zutnlp@YQ1:~/wueryong/nvidia/cuDNNN$ sudo dpkg -i libcudnn7-doc_7.6.5.32-1+cuda10.1_amd64.deb
Selecting previously unselected package libcudnn7-doc.
(Reading database ... 351248 files and directories currently installed.)
Preparing to unpack libcudnn7-doc_7.6.5.32-1+cuda10.1_amd64.deb ...
Unpacking libcudnn7-doc (7.6.5.32-1+cuda10.1) ...
Setting up libcudnn7-doc (7.6.5.32-1+cuda10.1) ...
3.3进项编译测试:
zutnlp@YQ1:/usr/src/nvidia-418.87.01$ cp -r /usr/src/cudnn_samples_v7/ ~
zutnlp@YQ1:/usr/src/nvidia-418.87.01$ cd ~/cudnn_samples_v7/mnistCUDNN
zutnlp@YQ1:~/cudnn_samples_v7/mnistCUDNN$ make clean && make
rm -rf *o
rm -rf mnistCUDNN
Linking agains cublasLt = true
CUDA VERSION: 10010
TARGET ARCH: x86_64
HOST_ARCH: x86_64
TARGET OS: linux
SMS: 30 35 50 53 60 61 62 70 72 75
/usr/local/cuda/bin/nvcc -ccbin g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o fp16_dev.o -c fp16_dev.cu
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -o fp16_emu.o -c fp16_emu.cpp
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -o mnistCUDNN.o -c mnistCUDNN.cpp
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcublasLt -LFreeImage/lib/linux/x86_64 -LFreeImage/lib/linux -lcudart -lcublas -lcudnn -lfreeimage -lstdc++ -lm
zutnlp@YQ1:~/cudnn_samples_v7/mnistCUDNN$ ./mnistCUDNN
cudnnGetVersion() : 7605 , CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
Host compiler version : GCC 4.9.4
There are 2 CUDA capable devices on your machine :
device 0 : sms 56 Capabilities 6.0, SmClock 1328.5 Mhz, MemSize (Mb) 12198, MemClock 715.0 Mhz, Ecc=1, boardGroupID=0
device 1 : sms 56 Capabilities 6.0, SmClock 1328.5 Mhz, MemSize (Mb) 12198, MemClock 715.0 Mhz, Ecc=1, boardGroupID=1
Using device 0
Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.031264 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.068768 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.079424 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.095104 time requiring 203008 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.101568 time requiring 2057744 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.024544 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.059488 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.068256 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.076192 time requiring 203008 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.090816 time requiring 2057744 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
4.官方参考文档
官方文档:https://docs.nvidia.com/deeplearning/sdk/cudnn-install/