GPU云主机安装驱动和cuda环境

测试环境申请到一台GPU云主机资源,从0开始进行云主机的使用和应用测试

硬件环境

类型

参数

型号

CPU

152 core

Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz * 2

Mem

1TB

Samsung M393A4K40EB3-CWE 32GB * 32

OS Disk

480 G (单盘跑)

Samsung MZ7L3480HCHQ-00B7C 480GB * 2

Data Disk

NVME 1.8T * 4(暂未使用)

Samsung MZQL21T9HCJR-00B7C 1.8T * 4

Raid Card

LSI SAS9311-8i

Net

Bond1 50GB (mode=4)

Bond0 10GB(mode=1)

Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Intel Corporation Ethernet Controller X710 for 10GbE SFP+

GPU

320G(40 * 8)

nivida GA100 [A800 SXM4 40GB]

查看CPU: lscpu 或  dmidecode -t processor
查看内存:dmidecode -t memory
查看硬盘:lsblk -d -o NAME,MODEL,SIZE,TRAN
查看阵列卡:lspci 查看slot的设备型号,然后安装对应的阵列卡工具
查看网卡:lspci | grep -i ethernet 或 ethtool
查看GPU:lscpi 查看英伟达的信息
  举例: lspci 查看的信息是3D controller: NVIDIA Corporation Device 20bd 
  可以通过 PCI ID Repository 等网站查找设备 ID 对应的详细信息。
  具体查找方式如下:
        访问 https://admin.pci-ids.ucw.cz/read/PC/
        在网站中输入厂商 ID 和设备 ID。
            厂商 ID(Vendor ID)为 10de(对应 NVIDIA Corporation)。
            设备 ID(Device ID)为 20bd。

操作系统版本centos 7.9

[root@bms-38735070 ~]# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7

[root@bms-38735070 ~]# cat /proc/version 
Linux version 3.10.0-1160.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Mon Oct 19 16:18:59 UTC 2020

配置阿里云的yum源

备份原文件
cd /etc/yum.repos.d
mkdir backup
mv *.repo backup

下载阿里云的yum源
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo

清理缓存生成新yum源缓存
yum clean all
yum makecache
yum repolist

安装显卡驱动

下载显卡驱动,下载 NVIDIA 官方驱动 | NVIDIA 

将文件上传到系统/usr/local/src目录
首先安装驱动时必要的包
# yum install kernel-devel kernel-headers gcc make dkms

禁用 Nouveau 驱动(这是 Linux 自带的开源 NVIDIA 驱动,需要在安装 NVIDIA 官方驱动之前禁用。)
创建一个配置文件来禁用 Nouveau 驱动:
# bash -c 'echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf'
# bash -c 'echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf'
重新生成 initramfs:
# dracut --force
# reboot

安装驱动
# cd /usr/local/src/
# chmod u+x NVIDIA-Linux-x86_64-535.183.06.run 
# ./NVIDIA-Linux-x86_64-535.183.06.run
此时会出现如下报错
ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example,be sure you have the ‘kernel-source’ or ‘kernel-devel’ RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the ‘–kernel-source-path’ command line option.
检查安装的kernel-devel的包
# rpm -qa | grep kernel
    kernel-tools-libs-3.10.0-1160.el7.x86_64
    kernel-3.10.0-1160.el7.x86_64
    kernel-debug-devel-3.10.0-1160.119.1.el7.x86_64
    kernel-devel-3.10.0-1160.119.1.el7.x86_64
    kernel-tools-3.10.0-1160.el7.x86_64
    kernel-headers-3.10.0-1160.119.1.el7.x86_64

还需要安装和内核后缀一致的kernel-devel的包
# yum install "kernel-devel-uname-r == $(uname -r)"
再次检查
# rpm -qa |grep kernel
    kernel-tools-libs-3.10.0-1160.el7.x86_64
    kernel-3.10.0-1160.el7.x86_64
    kernel-debug-devel-3.10.0-1160.119.1.el7.x86_64
    kernel-devel-3.10.0-1160.119.1.el7.x86_64
    kernel-devel-3.10.0-1160.el7.x86_64
    kernel-tools-3.10.0-1160.el7.x86_64
    kernel-headers-3.10.0-1160.119.1.el7.x86_64

再次安装,均按默认选择即可
# ./NVIDIA-Linux-x86_64-535.183.06.run
# nvidia-smi

安装CUDA

下载cuda,链接如下

https://developer.nvidia.com/cuda-12-2-0-download-archive

按照提示进行安装

# wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.runsudo sh cuda_12.2.0_535.54.03_linux.run
# sh cuda_12.2.0_535.54.03_linux.run

安装完成后添加cuda路径到环境变量里

# cat << EOF >> /root/.bashrc
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
EOF

# source /root/.bashrc

查看cuda版本

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值