硬件配置
型号:浪潮CS5260H2
CPU:(海光),架构x86 Hygon C86 5380 16-core Processor
GPU:Tesla T4
RAID卡:pm8204
网卡信息:网讯 WX1860A2
命令输出如下:
(base) [root@localhost ~]# lshw -c network
*-network:0
description: Ethernet interface
product: WX1860A2 Gigabit Ethernet Controller
vendor: Beijing Wangxun Technology Co., Ltd.
physical id: 0
bus info: pci@0000:01:00.0
logical name: em1
version: 01
serial: 9c:c2:c4:61:7a:df
size: 1Gbit/s
capacity: 1Gbit/s
width: 64 bits
系统安装
前期系统安装步骤省略,使用ventoy刻录U盘,下载镜像后,拷贝进入U盘,启动安装就可
系统版本
(base) [root@localhost ~]# nkvers
############## Kylin Linux Version #################
Release:
Kylin Linux Advanced Server release V10 (Lance)
Kernel:
4.19.90-52.22.v2207.ky10.x86_64
Build:
Kylin Linux Advanced Server
release V10 (SP3) /(Lance)-x86_64-Build23/20230324
#################################################
(base) [root@localhost ~]#
内核版本
(base) [root@localhost ~]# uname -a
Linux localhost.localdomain 4.19.90-52.22.v2207.ky10.x86_64 #1 SMP Tue Mar 14 12:19:10 CST 2023 x86_64 x86_64 x86_64 GNU/Linux
(base) [root@localhost ~]# rpm -qa | grep kernel
kernel-modules-extra-4.19.90-52.22.v2207.ky10.x86_64
kernel-tools-4.19.90-52.22.v2207.ky10.x86_64
kernel-core-4.19.90-52.22.v2207.ky10.x86_64
kernel-devel-4.19.90-52.22.v2207.ky10.x86_64
kernel-tools-libs-4.19.90-52.22.v2207.ky10.x86_64
kernel-headers-4.19.90-52.22.v2207.ky10.x86_64
kernel-4.19.90-52.22.v2207.ky10.x86_64
kernel-modules-4.19.90-52.22.v2207.ky10.x86_64
(base) [root@localhost ~]#
屏蔽nouveau,启动模式修改
编辑 /lib/modprobe.d/dist-blacklist.conf
#注释 nvidiafb
#blacklist nvidiafb
#添加以下两行
blacklist nouveau
options nouveau modeset=0
重建initramfs
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
修改系统启动模式
查看当前启动模式
systemctl get-default
设置为命令行模式
systemctl set-default multi-user.target
重启
执行reboot
cuda依赖配置与安装
安装gcc,如系统自带此步骤可省略
yum install gcc gcc-c++
检查gcc版本
(base) [root@localhost ~]# gcc --version
gcc (GCC) 7.3.0
Copyright © 2017 Free Software Foundation, Inc.
本程序是自由软件;请参看源代码的版权声明。本软件没有任何担保;
包括没有适销性和某一专用目的下的适用性担保。
(base) [root@localhost ~]#
检查内核版本
(base) [root@localhost ~]# ls /boot | grep vmlinu
vmlinuz-0-rescue-4f9a6ce3aaba4101b848f3a5814fe999
vmlinuz-4.19.90-52.22.v2207.ky10.x86_64
(base) [root@localhost ~]#
(base) [root@localhost ~]# rpm -aq | grep kernel-devel
kernel-devel-4.19.90-52.22.v2207.ky10.x86_64
(base) [root@localhost ~]# rpm -aq | grep kernel
kernel-modules-extra-4.19.90-52.22.v2207.ky10.x86_64
kernel-tools-4.19.90-52.22.v2207.ky10.x86_64
kernel-core-4.19.90-52.22.v2207.ky10.x86_64
kernel-devel-4.19.90-52.22.v2207.ky10.x86_64
kernel-tools-libs-4.19.90-52.22.v2207.ky10.x86_64
kernel-headers-4.19.90-52.22.v2207.ky10.x86_64
kernel-4.19.90-52.22.v2207.ky10.x86_64
kernel-modules-4.19.90-52.22.v2207.ky10.x86_64
(base) [root@localhost ~]#
如缺少进行yum安装或者升级内核保持一致即可,安装cuda不一会导致报错
cuda安装
官方文档CUDA按章配置可参考我另一篇博客:
Ubuntu22.04.4安装配置CUDA12.5,Cdnn官方详细版本_ubuntu 22.04.4-CSDN博客
这里麒麟系统我下载的Centos7的二进制包,执行安装即可
cuda历史版本及文档如下链接:
https://developer.nvidia.com/cuda-toolkit-archive
脚本执行如下:
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run
安装python3.11
安装依赖
yum -y install epel-release wget make cmake gcc bzip2-devel libffi-devel zlib-devel
安装所有开发工具:
sudo yum -y groupinstall "Development Tools"
检查openssl版本
(base) [root@localhost ~]# openssl version
OpenSSL 3.0.15 3 Sep 2024 (Library: OpenSSL 3.0.15 3 Sep 2024)
(base) [root@localhost ~]#
源码下载
wget https://www.python.org/ftp/python/3.11.2/Python-3.11.2.tgz
解压
tar xvf Python-3.11.2.tgz
目录输出如下
(base) [root@localhost ~]# cd Python-3.11.2/
(base) [root@localhost Python-3.11.2]# ls
aclocal.m4 config.guess config.sub Doc install-sh LICENSE Makefile.pre Modules PC Programs pyconfig.h.in python-config README.rst
_bootstrap_python config.log configure Grammar Lib Mac Makefile.pre.in Objects PCbuild pybuilddir.txt python python-config.py setup.py
build config.status configure.ac Include libpython3.11.a Makefile Misc Parser platform pyconfig.h Python python-gdb.py Tools
(base) [root@localhost Python-3.11.2]#
切换目录,编译安装
LDFLAGS="${LDFLAGS} -Wl,-rpath=/usr/local/openssl/lib" ./configure --with-openssl=/usr/local/openssl && make
安装完成
make altinstall
安装完成替换软连接
unlink /usr/bin/python
unlink /usr/bin/python3
link -s /usr/local/bin/python3.11 /usr/bin/python
link -s /usr/local/bin/python3.11 /usr/bin/python3
(base) [root@localhost Python-3.11.2]# python
Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
安装Anaconda
下载安装脚本
wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
网页显示如下可选其他版本:地址如下https://repo.anaconda.com/archive/
安装
bash Anaconda3-2024.10-1-Linux-x86_64.sh
输出信息如下
Version 4.0 | Last Modified: March 31, 2024 | ANACONDA TOS
Do you accept the license terms? [yes|no]
>>> yes
Anaconda3 will now be installed into this location:
/root/anaconda3
- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below
[/root/anaconda3] >>>
PREFIX=/root/anaconda3
Unpacking payload ...
Installing base environment...
Downloading and Extracting Packages:
## Package Plan ##
added / updated specs:
- defaults/linux-64::_anaconda_depends==2024.10=py312_mkl_0[md5=c9ba3a4910c6668be6c04058513aca5d]
- defaults/linux-64::_libgcc_mutex==0.1=main[md5=c3473ff8bdb3d124ed5ff11ec380d6f9]
- defaults/linux-64::_openmp_mutex==5.1=1_gnu[md5=71d281e9c2192cb3fa425655a8defb85]
- defaults/linux-64::aiobotocore==2.12.3=py312h06a4308_0[md5=b54b2fa16e83039c4398d0b1d16c8cd9]
- defaults/linux-64::aiohappyeyeballs==2.4.0=py312h06a4308_0[md5=305f03cf0cb08197bafad774073c8880]
- defaults/linux-64::aiohttp==3.10.5=py312h5eee18b_0[md5=442b1f6b84684ca9bc6fdcd2d5ed7d40]
.....................................................................................
Downloading and Extracting Packages:
Preparing transaction: done
Executing transaction: done
installation finished.
Do you wish to update your shell profile to automatically initialize conda?
This will activate conda on startup and change the command prompt when activated.
If you'd prefer that conda's base environment not be activated on startup,
run the following command when conda is activated:
conda config --set auto_activate_base false
You can undo this by running `conda init --reverse $SHELL`? [yes|no]
[no] >>> yes
no change /root/anaconda3/condabin/conda
no change /root/anaconda3/bin/conda
no change /root/anaconda3/bin/conda-env
no change /root/anaconda3/bin/activate
no change /root/anaconda3/bin/deactivate
no change /root/anaconda3/etc/profile.d/conda.sh
no change /root/anaconda3/etc/fish/conf.d/conda.fish
no change /root/anaconda3/shell/condabin/Conda.psm1
no change /root/anaconda3/shell/condabin/conda-hook.ps1
no change /root/anaconda3/lib/python3.12/site-packages/xontrib/conda.xsh
no change /root/anaconda3/etc/profile.d/conda.csh
modified /root/.bashrc
==> For changes to take effect, close and re-open your current shell. <==
Thank you for installing Anaconda3!
安装完成
执行conda search torch查找包
(base) [root@localhost ~]# conda search torch
Loading channels: done
No match found for: torch. Search: *torch*
# Name Version Build Channel
_pytorch_select 0.1 cpu_0 pkgs/main
_pytorch_select 0.2 gpu_0 pkgs/main
diffusers-torch 0.11.0 py310h2f386ee_0 pkgs/main
diffusers-torch 0.11.0 py37hb070fc8_0 pkgs/main
diffusers-torch 0.11.0 py38hb070fc8_0 pkgs/main
diffusers-torch 0.11.0 py39hb070fc8_0 pkgs/main
diffusers-torch 0.18.2 py310h2f386ee_0 pkgs/main
diffusers-torch 0.18.2 py311h92b7b1e_0 pkgs/main
diffusers-torch 0.18.2 py312he106c6f_0 pkgs/main
diffusers-torch 0.18.2 py38h2f386ee_0 pkgs/main
diffusers-torch 0.18.2 py39h2f386ee_0 pkgs/main
diffusers-torch 0.30.3 py310h06a4308_0 pkgs/main
diffusers-torch 0.30.3 py311h06a4308_0 pkgs/main
diffusers-torch 0.30.3 py312h06a4308_0 pkgs/main
diffusers-torch 0.30.3 py38h06a4308_0 pkgs/main
diffusers-torch 0.30.3 py39h06a4308_0 pkgs/main
intel-extension-for-pytorch 1.12.1 py310h6a678d5_0 pkgs/main
intel-extension-for-pytorch 1.12.1 py38h6a678d5_0 pkgs/main
intel-extension-for-pytorch 1.12.1 py39h6a678d5_0 pkgs/main
pytorch 0.2.0 py27cuda7.5cudnn5.1_0 pkgs/main
pytorch 0.2.0 py27cuda7.5cudnn6.0_0 pkgs/main
pytorch 0.2.0 py27cuda8.0cudnn5.1_0 pkgs/main
pytorch 0.2.0 py27cuda8.0cudnn6.0_0 pkgs/main
pytorch 0.2.0 py35cuda7.5cudnn5.1_0 pkgs/main
pytorch 0.2.0 py35cuda7.5cudnn6.0_0 pkgs/main
pytorch 0.2.0 py35cuda8.0cudnn5.1_0 pkgs/main
pytorch 0.2.0 py35cuda8.0cudnn6.0_0 pkgs/main
pytorch 0.2.0 py36cuda7.5cudnn5.1_0 pkgs/main
pytorch 0.2.0 py36cuda7.5cudnn6.0_0 pkgs/main
pytorch 0.2.0 py36cuda8.0cudnn5.1_0 pkgs/main
pytorch 0.2.0 py36cuda8.0cudnn6.0_0 pkgs/main
pytorch 0.3.0 py27cuda7.5cudnn6.0_0 pkgs/main
pytorch 0.3.0 py27cuda8.0cudnn6.0_0 pkgs/main
pytorch 0.3.0 py27cuda8.0cudnn7.0_0 pkgs/main
pytorch 0.3.0 py35cuda7.5cudnn6.0_0 pkgs/main
pytorch 0.3.0 py35cuda8.0cudnn6.0_0 pkgs/main
安装pytoch
官网参考
https://pytorch.ac.cn/get-started/previous-versions/
Linux 和 Windows
# CUDA 11.8
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia
# CUDA 12.1
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# CUDA 12.4
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia
# CPU Only
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 cpuonly -c pytorch
执行安装
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
输出信息如下
(base) [root@localhost ~]# conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
Channels:
- pytorch
- nvidia
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /root/anaconda3
added / updated specs:
- pytorch-cuda=12.1
- pytorch==2.4.1
- torchaudio==2.4.1
- torchvision==0.19.1
The following packages will be downloaded:
package | build
---------------------------|-----------------
cuda-cudart-12.1.105 | 0 189 KB nvidia
cuda-cupti-12.1.105 | 0 15.4 MB nvidia
cuda-libraries-12.1.0 | 0 2 KB nvidia
cuda-nvrtc-12.1.105 | 0 19.7 MB nvidia
cuda-nvtx-12.1.105 | 0 57 KB nvidia
cuda-opencl-12.6.77 | 0 25 KB nvidia
cuda-runtime-12.1.0 | 0 1 KB nvidia
cuda-version-12.6 | 3 16 KB nvidia
ffmpeg-4.3 | hf484d3e_0 9.9 MB pytorch
gmp-6.2.1 | h295c915_3 544 KB
gnutls-3.6.15 | he1e5248_0 1.0 MB
lame-3.100 | h7b6447c_0 323 KB
libcublas-12.1.0.26 | 0 329.0 MB nvidia
libcufft-11.0.2.4 | 0 102.9 MB nvidia
libcufile-1.11.1.6 | 0 895 KB nvidia
libcurand-10.3.7.77 | 0 39.7 MB nvidia
libcusolver-11.4.4.55 | 0 98.3 MB nvidia
libcusparse-12.0.2.55 | 0 163.0 MB nvidia
libidn2-2.3.4 | h5eee18b_0 146 KB
libjpeg-turbo-2.0.0 | h9bf148f_0 950 KB pytorch
libnpp-12.0.2.50 | 0 139.8 MB nvidia
libnvjitlink-12.1.105 | 0 16.9 MB nvidia
libnvjpeg-12.1.1.14 | 0 2.9 MB nvidia
libtasn1-4.19.0 | h5eee18b_0 63 KB
libunistring-0.9.10 | h27cfd23_0 536 KB
llvm-openmp-14.0.6 | h9e868ea_0 4.4 MB
nettle-3.7.3 | hbbd107a_1 809 KB
openh264-2.1.1 | h4ff587b_0 711 KB
pytorch-2.4.1 |py3.12_cuda12.1_cudnn9.1.0_0 1.35 GB pytorch
pytorch-cuda-12.1 | ha16c6d3_6 7 KB pytorch
pytorch-mutex-1.0 | cuda 3 KB pytorch
torchaudio-2.4.1 | py312_cu121 6.4 MB pytorch
torchtriton-3.0.0 | py312 233.5 MB pytorch
torchvision-0.19.1 | py312_cu121 8.5 MB pytorch
------------------------------------------------------------
Total: 2.52 GB
The following NEW packages will be INSTALLED:
cuda-cudart nvidia/linux-64::cuda-cudart-12.1.105-0
cuda-cupti nvidia/linux-64::cuda-cupti-12.1.105-0
cuda-libraries nvidia/linux-64::cuda-libraries-12.1.0-0
cuda-nvrtc nvidia/linux-64::cuda-nvrtc-12.1.105-0
cuda-nvtx nvidia/linux-64::cuda-nvtx-12.1.105-0
cuda-opencl nvidia/linux-64::cuda-opencl-12.6.77-0
cuda-runtime nvidia/linux-64::cuda-runtime-12.1.0-0
cuda-version nvidia/noarch::cuda-version-12.6-3
ffmpeg pytorch/linux-64::ffmpeg-4.3-hf484d3e_0
gmp pkgs/main/linux-64::gmp-6.2.1-h295c915_3
gnutls pkgs/main/linux-64::gnutls-3.6.15-he1e5248_0
lame pkgs/main/linux-64::lame-3.100-h7b6447c_0
libcublas nvidia/linux-64::libcublas-12.1.0.26-0
libcufft nvidia/linux-64::libcufft-11.0.2.4-0
libcufile nvidia/linux-64::libcufile-1.11.1.6-0
libcurand nvidia/linux-64::libcurand-10.3.7.77-0
libcusolver nvidia/linux-64::libcusolver-11.4.4.55-0
libcusparse nvidia/linux-64::libcusparse-12.0.2.55-0
libidn2 pkgs/main/linux-64::libidn2-2.3.4-h5eee18b_0
libjpeg-turbo pytorch/linux-64::libjpeg-turbo-2.0.0-h9bf148f_0
libnpp nvidia/linux-64::libnpp-12.0.2.50-0
libnvjitlink nvidia/linux-64::libnvjitlink-12.1.105-0
libnvjpeg nvidia/linux-64::libnvjpeg-12.1.1.14-0
libtasn1 pkgs/main/linux-64::libtasn1-4.19.0-h5eee18b_0
libunistring pkgs/main/linux-64::libunistring-0.9.10-h27cfd23_0
llvm-openmp pkgs/main/linux-64::llvm-openmp-14.0.6-h9e868ea_0
nettle pkgs/main/linux-64::nettle-3.7.3-hbbd107a_1
openh264 pkgs/main/linux-64::openh264-2.1.1-h4ff587b_0
pytorch pytorch/linux-64::pytorch-2.4.1-py3.12_cuda12.1_cudnn9.1.0_0
pytorch-cuda pytorch/linux-64::pytorch-cuda-12.1-ha16c6d3_6
pytorch-mutex pytorch/noarch::pytorch-mutex-1.0-cuda
torchaudio pytorch/linux-64::torchaudio-2.4.1-py312_cu121
torchtriton pytorch/linux-64::torchtriton-3.0.0-py312
torchvision pytorch/linux-64::torchvision-0.19.1-py312_cu121
Proceed ([y]/n)? y
Downloading and Extracting Packages:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(base) [root@localhost ~]#
图片信息截图
完装完成,验证
检查torch版本
检查cuda版本
检查CUDA与PyTorch版本的匹配问题
(base) [root@localhost ~]# python
Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.4.1
>>> print(torch.cuda.is_available())
True
>>> print(torch.version.cuda)
12.1
>>>
输出结果正确
执行conda list 查看torch包信息
执行conda list | grep torch
(base) [root@localhost ~]# conda list | grep torch
ffmpeg 4.3 hf484d3e_0 pytorch
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
pytorch 2.4.1 py3.12_cuda12.1_cudnn9.1.0_0 pytorch
pytorch-cuda 12.1 ha16c6d3_6 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 2.4.1 py312_cu121 pytorch
torchtriton 3.0.0 py312 pytorch
torchvision 0.19.1 py312_cu121 pytorch
(base) [root@localhost ~]#
至此,麒麟系统配置cuda,pytorch 配置完成,python里面调用GPU成功