Ubuntu16.04安装cuda11.1并保持多cuda切换

这是一个记录帖,以防出错漏掉细节,仅供参考。全程遇到过许多问题,不可按行照抄解决方案,因为中途的解决方案可能是不work的。如果有人能解决一下中间的问题不胜感激。

目标:已存cuda10.1,安装cuda11.1,希望切换使用

1. 查询系统及驱动信息:

cat /proc/driver/nvidia/version

 2. 在官网下载cuda进行安装:

官网链接:CUDA Toolkit Archive | NVIDIA Developer

选择CUDA版本后,选择系统信息后复制命令进行下载:

wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run

continue, accecpt, 全选install后报错:Installation failed. See log at /var/log/cuda-installer.log for details. 发现driver安装失败,搜索后发现说是因为原本有driver这里就不应该选择安装。但是原driver版本太低,不适用CUDA11.1,需要更新。

3. 重装驱动

(1)删除原驱动

sudo apt-get remove nvidia-*
sudo apt-get autoremove

报错:E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution)

——大概是包有损失。

(2)安装缺失的包

根据提示运行:

apt-get -f install

接着报错:E: Sub-process /usr/bin/dpkg returned an error code (1)

搜索后有一个解决的建议:

先备份再删除dqkg/info的内容,结果还是不行——其实因为这段命令有问题,没删掉,后文有正确的清空info文件夹操作

sudo cp -rp /var/lib/dpkg/info/ /var/lib/dpkg/info.bak/
sudo rm -rf /var/lib/dpkg/info/*xxx*
sudo apt-get -f install

之前只关注了最后一行报错信息,其实前面还有一些报错内容:

dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-cuda-dev_7.5.18-0ubuntu1_amd64.deb

搜索后得到建议:

sudo dpkg -i --force-overwrite /var/cache/apt/archives/nvidia-cuda-dev_7.5.18-0ubuntu1_amd64.deb

结果继续报新的错:

nvidia-cuda-dev depends on libcudart7.5 (= 7.5.18-0ubuntu1); however:
  Package libcudart7.5:amd64 is not configured yet.

dpkg: error processing package nvidia-cuda-dev (--install):
 dependency problems - leaving unconfigured
Processing triggers for man-db (2.7.5-1) ...
Errors were encountered while processing:

继续搜搜给的建议回到了清空dqkg/info这里,我疑惑,不是已经清理过了吗,到这个文件夹里一看好家伙内容全都在呢。于是按照新建议操作:

sudo mv /var/lib/dpkg/info/ /var/lib/dpkg/info_old/
sudo mkdir /var/lib/dpkg/info/

这下是一个崭新的空info文件夹了,apt-get -f install成功。

sudo apt-get remove nvidia-*成功

sudo apt-get autoremove成功

用以下命令检查卸载干净与否:没有任何输出就是卸好了

sudo dpkg --list | grep nvidia-*

到这里删除原驱动成功,已经可以回到第2步用CUDA顺便装新的driver了!

下面是我的单独装驱动尝试——

(3)用PPA安装显卡驱动(需要网络):

sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update

sudo apt-get update时报错没有合法签名:

W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.2 Release: The following signatures were invalid: KEYEXPIRED 1681764390
W: GPG error: http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 Release: The following signatures were invalid: KEYEXPIRED 1578250443
W: The repository 'http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 Release' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: Failed to fetch https://repo.mongodb.org/apt/ubuntu/dists/xenial/mongodb-org/4.2/Release.gpg  The following signatures were invalid: KEYEXPIRED 1681764390
W: Some index files failed to download. They have been ignored, or old ones used instead.

搜索得到的建议:

curl -s https://repo.mongodb.org/apt/ubuntu/dists/xenial/mongodb-org/4.2/Release.gpg | sudo apt-key add -

报错:gpg: no valid OpenPGP data found.

搜索后得到的建议是分成2步,先下载到本地再添加key:

curl -O https://repo.mongodb.org/apt/ubuntu/dists/xenial/mongodb-org/4.2/Release.gpg
sudo apt-key add Release.gpg

结果仍然报同样的错。

尝试直接进行下一步:查找显卡驱动最新的版本号

sudo apt-cache search nvidia

出现很多行,但没有见到大于等于455.23的...最高版本只有418

尝试强行指定一下:

sudo apt-get install nvidia-driver-455-server

unable to locate...

好吧那换种方法,手动下载了来安装——

(1)查看显卡版本(其实....大家不搜应该也知道

lspci | grep -i vga

得到4行结果:

19:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1) 

复制第一行的十六进制编码1e04到以下链接查询框,jump后得到结果—— 我的显卡是2080Ti

PCI devices

(2)官网查找合适的版本:

从这一步后参考【Ubuntu】16.04服务器:驱动更新+cuda11+cudnn_服务器升级cuda版本_摇曳的树的博客-CSDN博客

 在官网NVIDIA GeForce 驱动程序 - N 卡驱动 | NVIDIA 按照自己的显卡与系统信息填写,特别地,那位博主提供了一个保留200个旧版本结果的方法:在浏览器开发者模式中,在控制台输入以下命令,之后再点击开始搜索:

SystemScanner.prototype.DriverSearch = function(psid, pfid, osID, langCode, whql, beta, dltype, numresults ) {numresults=200;this.scannerStatusUpdate(GFE_SERVER_CONNECTING);theScanner.scannedDevice.downloadInfo=new Object();var parameters='psid='+psid;parameters+='&pfid='+pfid;parameters+='&osID='+osID;parameters+='&languageCode='+langCode;parameters+='&beta='+beta;parameters+='&isWHQL='+whql;parameters+="&dltype="+dltype;parameters+="&sort1=0";parameters+="&numberOfResults="+numresults;var requestUrl=this.driverManualLookupUrl+parameters;this.driversLogUIEvent("warn","SUID:"+this.tracker.scanID+" BEGIN DriverSearch requestUrl:"+requestUrl);this.debugTrace(requestUrl);jQuery.ajax({url:requestUrl,async:false,type:'get',success:function(response){try{theScanner.debugTrace("The Driver Lookup Service Returned:\n\n("+response+")");if(response.length>0){theScanner.resetResults();var driverLookupJsonObj='('+response+')';theScanner.resultsList=new Object();theScanner.resultsList=eval(driverLookupJsonObj)}if(theScanner.resultsList.Success==0){theScanner.scannerStatus="No driver available"}else{theScanner.scannerStatus="Results Ready"}}catch(e){this.driversLogUIEvent("error"," FAIL catch DriverSearch");theScanner.resetResults();theScanner.scannerStatus="No driver available"}},error:function(response){theScanner.resetResults();theScanner.scannerStatus="AJAX Call failed"}});this.driversLogUIEvent("warn","SUID:"+this.tracker.scanID+" END DriverSearch requestUrl:"+requestUrl);}

选择合适的版本进行下载后,先确保原驱动清除干净再跑:

sudo bash NVIDIA-Linux-x86_64-470.57.02.run -no-opengl-files -no-x-check

然而在安装NVIDIA-Linux-x86_64-455.45.01.run时遇到一个报错:

Unable to load the “nvidia-drm” kernel module
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

搜索来的一个解决方法:

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

发现有模块被禁用,将这些blacklist文件删除或者转为.bak文件

cd /etc/modprobe.d/
sudo mv blacklist-nvidia.conf blacklist-nvidia.conf.bak

重启后,再行安装,还是报同样的错,尝试另一个解决方法:BIOS中禁用Security BOOT选项。

然后最最最最离谱的事情发生了。这台GIGABYTE的电脑bios界面没有Security BOOT选项,蚌埠住了!

为今之计,只有更新BIOS了。。。但是这需要时间&去机房操作,很麻烦。

暂时通过前面报错log里提示的代码恢复原本的418驱动了:

apt-get install nvidia-418 nvidia-modprobe nvidia-settings

等等,我突然灵光一现,要不试一下把驱动删了,用CUDA安装时加上它的driver?

删除driver的命令:

sudo apt-get --purge remove nvidia-*

然后回到第2步CUDA安装,卧槽,居然成了!驱动更新成了cuda11.1的最低依赖455.23.05

4. 安装cuDNN:

到官网下载cuDNN:cuDNN Archive | NVIDIA Developer

选择适合系统和cuda的版本,我这里直接照抄上面提到的那个博客,下载了cudnn-11.3-linux-x64-v8.2.1.32.tgz,解压后复制到cuda下面:

tar zxvf cudnn-11.3-linux-x64-v8.2.1.32.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-11.1/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-11.1/lib64
sudo chmod a+r /usr/local/cuda-11.1/include/cudnn.h /usr/local/cuda-11.1/lib64/libcudnn*

5. 继续把cuda映射好:

要做到多cuda切换,那么环境变量、nvcc里只留cuda,再用软链接切换cuda指向cuda10.2还是cuda11.1。

(1)环境变量的修改:(按Esc然后输入:wq保存并退出)

vim ~/.bashrc

在最末尾将原本的对应内容改为:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/usr/local/cuda/lib64"
export PATH=$PATH:"/usr/local/cuda/bin"
export CUDA_HOME=$CUDA_HOME:"/usr/local/cuda"

(2)由于我nvcc -V命令得到的cuda版本不对,所以还修改了nvcc文件:

sudo vim /usr/bin/nvcc

将对应的那句改为:

exec /usr/local/cuda/bin/nvcc "$@"

然后再次通过nvcc -V可以看到已经对应到了新的cuda版本。

(3)查看软链接:

stat /usr/local/cuda

(4)切换软链接:

删除11.1的链接,切换回10.1:

sudo rm -rf cuda
sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值