CUDA&Fabric Manager安装文档
摘要:驱动、Fabric Manager、CUDA组件的版本需完全一致(主版本和次版本),特别是nvidia-fabricmanager与驱动版本强一致,需要先查看nvidia-smi确认驱动版本后选择对应的下载安装。安装顺序如下:
1、CUDA
系统内uname –a 显示ky10.x86_64
#麒麟v10安装 CUDA,目前NV驱动版本是535,搭配cuda12.2使用
https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-kylin10-12-2-local-12.2.0_535.54.03-1.x86_64.rpmwget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-kylin10-12-2-local-12.2.0_535.54.03-1.x86_64.rpm
#下载完成后执行
sudo rpm -i cuda-repo-kylin10-12-2-local-12.2.0_535.54.03-1.x86_64.rpm
sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda
#全部执行完后显示complete!
#添加环境变量
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
source ~/.bashrc
nvcc –version
输出正常,安装结束。
2、Fabric Manager
#查看本机的版本
rpm -qa | grep fabric-manager
#GPU驱动为535.54.03,版本不一致需要卸载
sudo yum remove nvidia-fabric-manager-535.129.03-1.x86_64
bash
# 配置NVIDIA仓库(适配麒麟V10的RHEL8路径)
sudo tee /etc/yum.repos.d/cuda.repo <<EOF
[cuda]
name=CUDA Repository
baseurl=https://developer.download.nvidia.cn/compute/cuda/repos/rhel8/x86_64/
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.cn/compute/cuda/repos/rhel8/x86_64/7fa2af80.pub
EOF
# 清理缓存并安装匹配版本
sudo yum clean all
sudo yum install nvidia-fabric-manager-535.54.03
#安装依赖包
sudo rpm -ivh nvidia-fabric-manager-535.54.03-1.x86_64.rpm
#启动服务并验证
sudo systemctl start nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager # 应显示 "active (running)"