一、参考资料
环境搭建01-Ubuntu16.04如何查看显卡信息及安装NVDIA显卡驱动
【超详细】【ubunbu 22.04】 手把手教你安装nvidia驱动,有手就行,隔壁家的老太太都能安装
二、环境配置
系统:Ubuntu16.04
显卡:NVIDIA GeForce GTX 1650,4GB
三、准备工作
下载显卡驱动
根据电脑的显卡型号,下载显卡驱动 下载链接
# 博主对应下载的驱动版本
NVIDIA-Linux-x86_64-470.57.02.run
安装依赖
sudo apt install gcc
sudo apt install g++
sudo apt install cmake
sudo apt install pkg-config
sudo apt install linux-headers*
四、关键步骤
1. 查看显卡型号、驱动
# 查看显卡型号
ubuntu-drivers devices
如果没有输出,继续下面的操作
2. 禁用nouveau
nouveau是系统自带的驱动程序,安装新的显卡驱动的时候,通常需要禁用nouveau驱动。
验证是否已禁用 nouveau:
lsmod | grep nouveau
若有输出,说明没有禁用,修改 /etc/modprobe.d/blacklist.conf
配置文件:
# 末尾添加
blacklist nouveau
options nouveau modeset=0
更新配置:
sudo update-initramfs –u
重启系统,查看是否已禁用:
sudo reboot
再次查看是否禁用 nouveau:
lsmod | grep nouveau
3. 安装ubuntu-drivers
# 添加ppa源
sudo add-apt-repository ppa:graphics-drivers/ppa
# 更新apt-get
sudo apt update
# 安装 ubuntu-drivers-common
sudo apt-get install ubuntu-drivers-common
# 查看可安装的显卡驱动版本
sudo ubuntu-drivers devices
# 安装推荐的显卡驱动(recommended)
sudo ubuntu-drivers autoinstall
# 安装指定的显卡驱动
sudo apt install [driver-name]
yoyo@yoyo:~/360Downloads$ sudo ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.2/0000:04:00.0 ==
modalias : pci:v000010DEd00002204sv000010DEsd00001454bc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-535-server-open - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-535-open - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
(可选)安装显卡驱动工具
sudo apt-get install nvidia-cuda-toolkit
4. 安装显卡驱动
# 1. 按 CTRL + ALT + F1 键登录,从 GUI 转至终端tty1(TUI文本用户界面)
# 2. 临时关闭图形界面
sudo service lightdm stop
# 3. 卸载已安装的显卡驱动
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove
# 4. 修改显卡驱动文件的权限
sudo chmod a+x NVIDIA-Linux-x86_64-396.18.run
# 5. 安装显卡驱动
sudo ./NVIDIA-Linux-x86_64-470.57.02.run -no-x-check -no-nouveau-check -no-opengl-files
参数说明:
1) –no-x-check:安装驱动时关闭X服务,不设置可能导致安装失败。
2) –no-nouveau-check:表示安装驱动时不检查nouveau,非必需,我们已经禁用驱动。
3) –no-opengl-files:只安装驱动文件,不安装OpenGL文件。这个参数不可省略,否则会导致登陆界面死循环,英语一般称为”login loop”或者”stuck in login”。
# 6. 开启图形界面,没自动跳的话 crtl+alt+f7退回到GUI图形用户界面
sudo service lightdm start
# 7. 若弹出设置对话框,则表示驱动安装成功
nvidia-settings
# 8. 查看显卡驱动
nvidia-smi
安装过程中的选项:
-
Install Nvidia's 32-bit compatibility libraries?
选择 “No” -
The distribution-provided pre-install script failed! Are you sure you want to continue?
选择 yes 继续。 -
Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?
选择NO继续 -
Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.
选择 Yes 继续
note:
开始会显示大量点点(…),然后进入图形化安装界面:
如果提示是否接受(accept),选接受;
如果提示检测到xxx不完整,是否退出安装,选不退出(continue);
如果提示有旧驱动,询问是否删除旧驱动,选Yes;
如果提示缺少某某模块(modules),询问是否上网下载,选no;
如果提示编译模块,询问是否进行编译,选ok;
如果提示将要修改Xorg.conf,询问是否允许,选Yes;
五、FAQ
Q:缺少软链接
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8 is not a symbolic link
/sbin/ldconfig.real: /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8 is not a symbolic link
解决办法:创建软链接。
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
sudo ln -sf /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.1 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
Q:测试cuDNN错误
Unsupported gpu architecture compute_30 on a CUDA 5 capable gpu
NVIDIA CUDA Toolkit 10.2.89
# 进入mnistCUDNN目录
cd /usr/src/cudnn_samples_v7/mnistCUDNN
# 编译mnistCUDNN
sudo make
或者
sudo make -j8
mnistCUDNN.cpp:538:53: error: ‘CUDNN_CONVOLUTION_FWD_PREFER_FASTEST’ was not declared in this scope
CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp:541:53: error: there are no arguments to ‘cudnnGetConvolutionForwardAlgorithm’ that depend on a template parameter, so a declaration of ‘cudnnGetConvolutionForwardAlgorithm’ must be available [-fpermissive]
) );
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp:541:53: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
) );
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp:541:53: error: there are no arguments to ‘cudnnGetConvolutionForwardAlgorithm’ that depend on a template parameter, so a declaration of ‘cudnnGetConvolutionForwardAlgorithm’ must be available [-fpermissive]
) );
^
error_util.h:67:65: note: in definition of macro ‘checkCUDNN’
_error << "CUDNN failure\nError: " << cudnnGetErrorString(status); \
^
mnistCUDNN.cpp: In instantiation of ‘void network_t<value_type>::convoluteForward(const Layer_t<value_type>&, int&, int&, int&, int&, value_type*, value_type**) [with value_type = float]’:
mnistCUDNN.cpp:739:25: required from ‘int network_t<value_type>::classify_example(const char*, const Layer_t<value_type>&, const Layer_t<value_type>&, const Layer_t<value_type>&, const Layer_t<value_type>&) [with value_type = float]’
mnistCUDNN.cpp:831:75: required from here
mnistCUDNN.cpp:533:60: error: ‘cudnnGetConvolutionForwardAlgorithm’ was not declared in this scope
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp:533:60: error: ‘cudnnGetConvolutionForwardAlgorithm’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:67:65: note: in definition of macro ‘checkCUDNN’
_error << "CUDNN failure\nError: " << cudnnGetErrorString(status); \
^
mnistCUDNN.cpp:533:60: note: ‘cudnnGetConvolutionForwardAlgorithm’ declared here, later in the translation unit
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp: In instantiation of ‘void network_t<value_type>::convoluteForward(const Layer_t<value_type>&, int&, int&, int&, int&, value_type*, value_type**) [with value_type = __half]’:
mnistCUDNN.cpp:739:25: required from ‘int network_t<value_type>::classify_example(const char*, const Layer_t<value_type>&, const Layer_t<value_type>&, const Layer_t<value_type>&, const Layer_t<value_type>&) [with value_type = __half]’
mnistCUDNN.cpp:898:83: required from here
mnistCUDNN.cpp:533:60: error: ‘cudnnGetConvolutionForwardAlgorithm’ was not declared in this scope
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
mnistCUDNN.cpp:533:60: error: ‘cudnnGetConvolutionForwardAlgorithm’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:67:65: note: in definition of macro ‘checkCUDNN’
_error << "CUDNN failure\nError: " << cudnnGetErrorString(status); \
^
mnistCUDNN.cpp:533:60: note: ‘cudnnGetConvolutionForwardAlgorithm’ declared here, later in the translation unit
checkCUDNN( cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
^
error_util.h:66:9: note: in definition of macro ‘checkCUDNN’
if (status != CUDNN_STATUS_SUCCESS) { \
^
Makefile:235: recipe for target 'mnistCUDNN.o' failed
make: *** [mnistCUDNN.o] Error 1
Linking agains cublasLt = true
CUDA VERSION: 11000
TARGET ARCH: x86_64
HOST_ARCH: x86_64
TARGET OS: linux
SMS: 30 35 50 53 60 61 62 70 72 75
/usr/local/cuda/bin/nvcc -ccbin g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o fp16_dev.o -c fp16_dev.cu
nvcc fatal : Unsupported gpu architecture 'compute_30'
Makefile:238: recipe for target 'fp16_dev.o' failed
make: *** [fp16_dev.o] Error 1
错误原因:
CUDA 11.x no longer supports compute capability 3.0
CUDA11.x不再支持compute_30
解决方法:
(1)修改Makefile文件。
sudo gedit /home/yoyo/MyDocuments/cudnn_samples_v7/mnistCUDNN/Makefile
$(foreach sm,$(SMS),$(eval GENCODE_FLAGS += -gencode arch=compute_$(sm),code=sm_$(sm)))
修改为
GENCODE_FLAGS += -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75
(2)修改 CMakeLists.txt 文件。
sudo gedit /usr/local/cuda-11.0/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt
Makefile.config
#>> nvcc -m64 -gencode=compute_20,code=sm_20 ... -dc math-funcs.cu -o math-funcs64.o
add_custom_command(OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/math-funcs64.o"
COMMAND ${NVCC} -m64 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -dc "${CMAKE_CURRENT_SOURCE_DIR}/math-funcs.cu" -o "${CMAKE_CURRENT_BINARY_DIR}/math-funcs64.o"
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/math-funcs.cu"
COMMENT "Building math-funcs64.o")
修改为
#>> nvcc -m64 -gencode=compute_20,code=sm_20 ... -dc math-funcs.cu -o math-funcs64.o
add_custom_command(OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/math-funcs64.o"
COMMAND ${NVCC} -m64 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -dc "${CMAKE_CURRENT_SOURCE_DIR}/math-funcs.cu" -o "${CMAKE_CURRENT_BINARY_DIR}/math-funcs64.o"
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/math-funcs.cu"
COMMENT "Building math-funcs64.o")
删除
-gencode=arch=compute_20,code=sm_20
-gencode=arch=compute_30,code=sm_30
Q:测试cuDNN错误
Makefile:163: ../samples_common.mk: 没有那个文件或目录
CUDA VERSION:
TARGET ARCH: x86_64
HOST_ARCH: x86_64
TARGET OS: linux
SMS:
>>> WARNING - no SM architectures have been specified - waiving sample <<<
make: *** No rule to make target '../samples_common.mk'。 停止。
Q:If you want to use the nvidia-installer please uninstall the Debian packages first.
错误原因:显卡驱动没有卸载干净。
解决方法:重新卸载显卡驱动。
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove
Q:ubuntu-drivers: command not found
ubuntu-drivers: command not found解决办法
yoyo@yoyo:/usr/lib/python3/dist-packages$ sudo ubuntu-drivers devices
sudo: ubuntu-drivers: command not found
解决方法:
sudo apt-get install ubuntu-drivers-common
Q:E: 软件包 ubuntu-drivers-common 没有可安装候选
yoyo@yoyo:/usr/lib/python3/dist-packages$ sudo apt-get install ubuntu-drivers-common
正在读取软件包列表... 完成
正在分析软件包的依赖关系树... 完成
正在读取状态信息... 完成
没有可用的软件包 ubuntu-drivers-common,但是它被其它的软件包引用了。
这可能意味着这个缺失的软件包可能已被废弃,
或者只能在其他发布源中找到
E: 软件包 ubuntu-drivers-common 没有可安装候选
解决方法:
sudo add-apt-repository ppa:ubuntu-drivers/ppa
sudo apt update
sudo apt install ubuntu-drivers-common