linux服务器相关信息查询，gpu驱动及cuda安装（自动及手动详解）

最新推荐文章于 2024-09-07 17:21:42 发布

爆肝疯学大模型

最新推荐文章于 2024-09-07 17:21:42 发布

阅读量984

点赞数 22

文章标签：服务器 linux 运维 gpu算力

本文链接：https://blog.csdn.net/weixin_41973200/article/details/141715016

版权

使用google服务器时，给了一台2卡a100gpu linux的debin12系统的服务器，但是上面什么都没有配置，所以需要再重新进行cuda装机，里面涉及到一些服务器自身设备信息的查询和cuda的安装，在这里详细介绍几种方法。

硬件信息查询

一、查看服务器厂商

1. 型号查询

[root@DevopsManager ~]# dmidecode | grep "Product"
	Product Name: Alibaba Cloud ECS

2. 显卡驱动查询

nvidia-smi

在这里插入图片描述

3. 查询显卡型号

lspci |grep -i vga

4.查询Linux系统的版本号

lsb_release -a

在这里插入图片描述

二、查看cpu的统计信息

[root@DevopsManager ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8				### 这里可以看出总共有8核
On-line CPU(s) list:   0-7
Thread(s) per core:    2				### 这里表示每个cpu核，只能支持2个线程
Core(s) per socket:    4				
Socket(s):             1				### 有一个CPU
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:              7
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni

三、查看系统有哪些接口

[root@DevopsManager ~]# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Communication controller: Red Hat, Inc. Virtio console
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:06.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:1f.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

gpu驱动安装及cuda工具包安装

一、 google自动化安装方案

参考：google自动安装gpu驱动方案
此方案为自动化安装方案，非常方便，也是最为推荐的，不需要自己再去安装查找版本，直接即可完成。

# 下载安装脚本
curl -L https://github.com/GoogleCloudPlatform/compute-gpu-installation/releases/download/cuda-installer-v1.1.0/cuda_installer.pyz --output cuda_installer.pyz
# 运行安装脚本 运行脚本需要一些时间。这会重启您的虚拟机。如果虚拟机重启，请再次运行脚本以继续安装。
sudo python3 cuda_installer.pyz install_driver
# 验证gpu驱动是否安装成功
nvidia-smi

# 使用此工具安装 CUDA 工具包。要安装 CUDA 工具包
sudo python3 cuda_installer.pyz install_cuda
# 此脚本可能至少需要 30 分钟才能完成运行。这会重启虚拟机。如果虚拟机重启，请再次运行脚本以继续安装。

# 验证工具包是否安装成功
sudo python3 cuda_installer.pyz verify_cuda

二、常规化安装方案

参考：
1.ubuntu gpu驱动 cuda工具包，pytorch安装教程

1. gpu驱动安装

根据显卡型号选择合适的显卡驱动

这里下载：https://www.nvidia.com/Download/index.aspx
运行驱动
sh .run文件存放的路径

2. cuda工具包安装

参考：
1.https://blog.51cto.com/u_16213695/10406244（安装教程借鉴1）
2.cuda工具版本对应关系
3.https://juejin.cn/post/7314152331218452531（安装教程借鉴2）
4.cuda工具包各版本下载
5.https://zhuanlan.zhihu.com/p/701577195（安装教程借鉴3）
6.
（0）查看是否有CUDA

nvcc -V 或 nvcc --version

（1）前往https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64选择适合自己服务器的版本，可以在线安装。
在这里插入图片描述
（2）使用在线自动安装方案

（3）安装工具包

（4）修改环境变量

在这里插入图片描述

问题解决

1. apt源恢复

参考：
1.为 Debian 12 的容器更换 APT 源
2.Ubuntu自带安装包删除恢复方法，以apt删除恢复为例
3.清华大学开源软件镜像站

为了安装docker，更改了apt的数据源，导致不能用了也没办法恢复，甚至apt-get update都运行不了
解决方案：
1.明确一点，原先应该修改的 /etc/apt/sources.list 文件现在已被移除。现在，默认源的相关配置已被移至/etc/apt/sources.list.d/debian.sources 文件。
2. 打开默认源文件，进行源修改，能恢复就恢复，不能恢复找个国内的镜像源也行

2. docker-gpu运行报错

docker启动发生报错

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

参考：

https://www.cnblogs.com/huakai201/p/18132554（没有解决我的问题）
https://stackoverflow.com/questions/75118992/docker-error-response-from-daemon-could-not-select-device-driver-with-capab（解决问题）
https://www.cnblogs.com/booturbo/p/16318627.html（没有解决我的问题）
https://zhuanlan.zhihu.com/p/688894010（没有解决我的问题）

# 1.Configure the repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& sudo apt-get update
# 2.Install the NVIDIA Container Toolkit packages:
sudo apt-get install -y nvidia-container-toolkit
# 3.Configure the container runtime by using the nvidia-ctk command:
sudo nvidia-ctk runtime configure --runtime=docker
# 4.Restart the Docker daemon:
sudo systemctl restart docker

爆肝疯学大模型

关注

22
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
linux服务器相关信息查询，gpu驱动及cuda安装（自动及手动详解）

使用google服务器时，给了一台2卡a100gpu linux的debin12系统的服务器，但是上面什么都没有配置，所以需要再重新进行cuda装机，里面涉及到一些服务器自身设备信息的查询和cuda的安装，在这里详细介绍几种方法。
复制链接

扫一扫