注. 服务器安装显卡驱动,cuda,cudnn和个人电脑相同。
整体步骤包括:
- 安装显卡驱动;
- 安装cuda;
- 安装cudnn
1. 安装显卡驱动
https://www.nvidia.cn/Download/index.aspx?lang=cn 官网下载驱动
# 屏蔽开源驱动
sudo dpkg --add-architecture i386
sudo apt install build-essential libc6:i386
sudo gedit /etc/modprobe.d/blacklist.conf
文件末尾添加,屏蔽开源驱动
blacklist nouveau
options nouveau modeset=0
执行如下后,如有显示则重启系统
sudo update-initramfs -u
lsmod | grep nouveau(sudo reboot)
个人电脑安装
# 服务器可以跳过下面两行命令,直接到卸载历史驱动
Ctrl-Alt+F1(进入命令行界面)
sudo service lightdm stop
# 卸载历史驱动
sudo apt-get remove nvidia-* # sudo dpkg -P $(dpkg -l | grep nvidia | awk '{print $2}')
sudo apt-get --purge remove "*nvidia*"
sudo apt-get --purge remove "*cublas*" "cuda*"
sudo apt-get --purge remove nvidia*
sudo apt-get autoremove
安装:
进入对应的存放目录
sudo chmod a+x NVIDIA-Linux-x86_64-455.23.04.run
sudo ./NVIDIA-Linux-x86_64-455.23.04.run –no-opengl-files -no-x-check -no-nouveau-check
对应过程中的选择
1. The distribution-provided pre-install script failed! Are you sure you want to continue? ----> CONTINUE INSTALLATION
2. Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later? ----> No
Nvidia's 32-bit compatibility libraries? ----> No
3. Would you like to run the nvidia-xconfig utility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up ----> YES
个人电脑执行如下命令(服务器等待完成即可)
sudo service lightdm start 重启进入界面模式
nvidia-smi查看是否安装成功
2. cuda安装
历史存在时,先卸载cuda.
# ×× 根据具体的版本而定
cd /usr/local/cuda-xx.x/bin/
sudo ./uninstall_cuda_xx.x.pl
sudo rm -rf /usr/local/cuda-xx.x
以上卸载不成功时使用如下命令
# 卸载cuda相关所有内容
sudo dpkg -P $(dpkg -l | grep nvidia | awk '{print $2}')
安装,根据官网的推荐选择deb安装即可:https://developer.nvidia.com/cuda-toolkit-archive
以下是官方推荐:
下面是根据官网选择的结果(编译成sh运行即可),注以下是历史,自己安装的需去官网复制。
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
加入环境变量
sudo vim ~/.bashrc 添加如下两句,注意查看cuda路径要正确
export PATH="/usr/local/cuda-11.8/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib:$LD_LIBRARY_PATH"
source ~/.bashrc
验证
nvidia-smi
3.cudnn安装
选择对应版本下载(https://developer.nvidia.com/rdp/cudnn-archive),后依次执行示例
3.1 压缩包安装
tar -xvf cudnn-***.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
# 新版本11.8使用如下
sudo cp include/* /usr/local/cuda-11.8/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
nvcc -V # 查看cudnn是否安装成功
3.2 deb方式安装
官方选择对应的deb后
sudo dpkg -i cudnn-local-repo-ubuntu1804-8.9.6.50_1.0-1_amd64.deb # 会提示如下命令
sudo cp /var/cudnn-local-repo-ubuntu1804-8.9.6.50/cudnn-local-A48BB858-keyring.gpg /usr/share/keyrings/
错误解决
- 如安装cudnn后,nvidia-smi消失,则重新安装显卡驱动即可
- libmkl_gf_lp64.so(其他so文件的处理方式相同,只要最后so.*版本一致即可)
遇到的问题:
1.安装Pytorch之后出现ImportError: libmkl_gnu_thread.so: cannot open shared object file: No such file or directory
首先找到linux上的libmkl_gf_lp64.so,复制当前的路径,很重要!!!
然后 cd /etc/ld.so.conf.d,使用sudo vi runtime-x86_64.conf建立一个文件,将刚刚的路径写入新建的文件中。
sudo ldconfig完成更新即可。
输入ipython,然后import torch 成功
- /sbin/ldconfig.real: /usr/local/cuda-11.8/lib64/libcudnn_cnn_infer.so.8 is not a symbolic link
基础逻辑为缺少对应文件,就找到对应文件建立软连接过去就OK。
locate libcudnn_cnn_infer.so.8 # 找到对应的环境后
sudo ln -sf /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.1 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
libcudart.so.10.2: cannot open shared object file
sudo cp -i libcudart.so.10.2 /usr/local/cuda-11.8/lib64 # 将文件放到lib64目录之下