测试环境申请到一台GPU云主机资源,从0开始进行云主机的使用和应用测试
硬件环境
类型 | 参数 | 型号 |
CPU | 152 core | Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz * 2 |
Mem | 1TB | Samsung M393A4K40EB3-CWE 32GB * 32 |
OS Disk | 480 G (单盘跑) | Samsung MZ7L3480HCHQ-00B7C 480GB * 2 |
Data Disk | NVME 1.8T * 4(暂未使用) | Samsung MZQL21T9HCJR-00B7C 1.8T * 4 |
Raid Card | LSI SAS9311-8i | |
Net | Bond1 50GB (mode=4) Bond0 10GB(mode=1) | Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Intel Corporation Ethernet Controller X710 for 10GbE SFP+ |
GPU | 320G(40 * 8) | nivida GA100 [A800 SXM4 40GB] |
查看CPU: lscpu 或 dmidecode -t processor
查看内存:dmidecode -t memory
查看硬盘:lsblk -d -o NAME,MODEL,SIZE,TRAN
查看阵列卡:lspci 查看slot的设备型号,然后安装对应的阵列卡工具
查看网卡:lspci | grep -i ethernet 或 ethtool
查看GPU:lscpi 查看英伟达的信息
举例: lspci 查看的信息是3D controller: NVIDIA Corporation Device 20bd
可以通过 PCI ID Repository 等网站查找设备 ID 对应的详细信息。
具体查找方式如下:
访问 https://admin.pci-ids.ucw.cz/read/PC/
在网站中输入厂商 ID 和设备 ID。
厂商 ID(Vendor ID)为 10de(对应 NVIDIA Corporation)。
设备 ID(Device ID)为 20bd。
操作系统版本centos 7.9
[root@bms-38735070 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7
[root@bms-38735070 ~]# cat /proc/version
Linux version 3.10.0-1160.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Mon Oct 19 16:18:59 UTC 2020
配置阿里云的yum源
备份原文件
cd /etc/yum.repos.d
mkdir backup
mv *.repo backup
下载阿里云的yum源
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
清理缓存生成新yum源缓存
yum clean all
yum makecache
yum repolist
安装显卡驱动
下载显卡驱动,下载 NVIDIA 官方驱动 | NVIDIA
将文件上传到系统/usr/local/src目录
首先安装驱动时必要的包
# yum install kernel-devel kernel-headers gcc make dkms
禁用 Nouveau 驱动(这是 Linux 自带的开源 NVIDIA 驱动,需要在安装 NVIDIA 官方驱动之前禁用。)
创建一个配置文件来禁用 Nouveau 驱动:
# bash -c 'echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf'
# bash -c 'echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf'
重新生成 initramfs:
# dracut --force
# reboot
安装驱动
# cd /usr/local/src/
# chmod u+x NVIDIA-Linux-x86_64-535.183.06.run
# ./NVIDIA-Linux-x86_64-535.183.06.run
此时会出现如下报错
ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example,be sure you have the ‘kernel-source’ or ‘kernel-devel’ RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the ‘–kernel-source-path’ command line option.
检查安装的kernel-devel的包
# rpm -qa | grep kernel
kernel-tools-libs-3.10.0-1160.el7.x86_64
kernel-3.10.0-1160.el7.x86_64
kernel-debug-devel-3.10.0-1160.119.1.el7.x86_64
kernel-devel-3.10.0-1160.119.1.el7.x86_64
kernel-tools-3.10.0-1160.el7.x86_64
kernel-headers-3.10.0-1160.119.1.el7.x86_64
还需要安装和内核后缀一致的kernel-devel的包
# yum install "kernel-devel-uname-r == $(uname -r)"
再次检查
# rpm -qa |grep kernel
kernel-tools-libs-3.10.0-1160.el7.x86_64
kernel-3.10.0-1160.el7.x86_64
kernel-debug-devel-3.10.0-1160.119.1.el7.x86_64
kernel-devel-3.10.0-1160.119.1.el7.x86_64
kernel-devel-3.10.0-1160.el7.x86_64
kernel-tools-3.10.0-1160.el7.x86_64
kernel-headers-3.10.0-1160.119.1.el7.x86_64
再次安装,均按默认选择即可
# ./NVIDIA-Linux-x86_64-535.183.06.run
# nvidia-smi
安装CUDA
下载cuda,链接如下
https://developer.nvidia.com/cuda-12-2-0-download-archive
按照提示进行安装
# wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.runsudo sh cuda_12.2.0_535.54.03_linux.run
# sh cuda_12.2.0_535.54.03_linux.run
安装完成后添加cuda路径到环境变量里
# cat << EOF >> /root/.bashrc
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
EOF
# source /root/.bashrc
查看cuda版本