背景
训练节点架构规划,这里主要记录ib网卡驱动配置和ib交换机端配置
实施
客户端ib网卡配置
系统版本
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
内核版本
Linux hostname 5.15.0-118-generic #128-Ubuntu SMP Fri Jul 5 09:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
硬件信息
lspci |grep -i mellanox
83:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
下载ib网卡驱动
curl -x socks5://xxx:xx -LO https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.7.0.0/MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64.tgz
配置
tar zxf MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64.tgz
cd MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64/
安装
./mlnxofedinstall
Logs dir: /tmp/MLNX_OFED_LINUX.17643.logs
General log file: /tmp/MLNX_OFED_LINUX.17643.logs/general.log
Below is the list of MLNX_OFED_LINUX packages that you have chosen
(some may have been added by the installer due to package dependencies):
ofed-scripts
mlnx-tools
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-dkms
iser-dkms
isert-dkms
srp-dkms
rdma-core
libibverbs1
ibverbs-utils
ibverbs-providers
libibverbs-dev
libibverbs1-dbg
libibumad3
libibumad-dev
ibacm
librdmacm1
rdmacm-utils
librdmacm-dev
mstflint
ibdump
libibmad5
libibmad-dev
libopensm
opensm
opensm-doc
libopensm-devel
libibnetdisc5
infiniband-diags
mft
kernel-mft-dkms
perftest
ibutils2
ibsim
ibsim-doc
ucx
sharp
hcoll
knem-dkms
knem
openmpi
mpitests
dpcp
srptools
mlnx-ethtool
mlnx-iproute2
rshim
ibarr
This program will install the MLNX_OFED_LINUX package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.
Do you want to continue?[y/N]:y
Checking SW Requirements...
One or more required packages for installing MLNX_OFED_LINUX are missing.
Attempting to install the following missing packages:
swig quilt libltdl-dev autotools-dev automake flex libfuse2 debhelper libnl-route-3-200 libgfortran5 autoconf libnl-3-dev chrpath pkg-config bison tk m4 libnl-route-3-dev graphviz gfortran dkms
Removing old packages...
Installing new packages
Installing ofed-scripts-24.04.OFED.24.04.0.7.0...
Installing mlnx-tools-24.04.0.2404066...
Installing mlnx-ofed-kernel-utils-24.04.OFED.24.04.0.7.0.1...
Installing mlnx-ofed-kernel-dkms-24.04.OFED.24.04.0.7.0.1...
Installing iser-dkms-24.04.OFED.24.04.0.7.0.1...
Installing isert-dkms-24.04.OFED.24.04.0.7.0.1...
Installing srp-dkms-24.04.OFED.24.04.0.7.0.1...
Installing rdma-core-2404mlnx51...
Installing libibverbs1-2404mlnx51...
Installing ibverbs-utils-2404mlnx51...
Installing ibverbs-providers-2404mlnx51...
Installing libibverbs-dev-2404mlnx51...
Installing libibverbs1-dbg-2404mlnx51...
Installing libibumad3-2404mlnx51...
Installing libibumad-dev-2404mlnx51...
Installing ibacm-2404mlnx51...
Installing librdmacm1-2404mlnx51...
Installing rdmacm-utils-2404mlnx51...
Installing librdmacm-dev-2404mlnx51...
Installing mstflint-4.16.1...
Installing ibdump-6.0.0...
Installing libibmad5-2404mlnx51...
Installing libibmad-dev-2404mlnx51...
Installing libopensm-5.19.0.MLNX20240421.b7c161a9...
Installing opensm-5.19.0.MLNX20240421.b7c161a9...
Installing opensm-doc-5.19.0.MLNX20240421.b7c161a9...
Installing libopensm-devel-5.19.0.MLNX20240421.b7c161a9...
Installing libibnetdisc5-2404mlnx51...
Installing infiniband-diags-2404mlnx51...
Installing mft-4.28.0...
Installing kernel-mft-dkms-4.28.0.92...
Installing perftest-24.04.0...
Installing ibutils2-2.1.1...
Installing ibsim-0.12...
Installing ibsim-doc-0.12...
Installing ucx-1.17.0...
Installing sharp-3.7.0.MLNX20240421.48444036...
Installing hcoll-4.8.3227...
Installing knem-dkms-1.1.4.90mlnx3...
Installing knem-1.1.4.90mlnx3...
Installing openmpi-4.1.7a1...
Installing mpitests-3.2.23...
Installing dpcp-1.1.48...
Installing srptools-2404mlnx51...
Installing mlnx-ethtool-6.7...
Installing mlnx-iproute2-6.7.0...
Installing rshim-2.0.28...
Installing ibarr-0.1.3...
Selecting previously unselected package mlnx-fw-updater.
(Reading database ... 126356 files and directories currently installed.)
Preparing to unpack .../mlnx-fw-updater_24.04-0.7.0.0_amd64.deb ...
Unpacking mlnx-fw-updater (24.04-0.7.0.0) ...
Setting up mlnx-fw-updater (24.04-0.7.0.0) ...
Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf
Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX6
Part Number: MCX653105A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
PSID: MT_00003
PCI Device Name: 83:00.0
Base GUID: 58a2e652
Versions: Current Available
FW 20.39.1002 20.41.1000
PXE 3.7.0201 3.7.0400
UEFI 14.32.0012 14.34.0012
Status: Update required
---------
Found 1 device(s) requiring firmware update...
Device #1: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK Done
Restart needed for updates to take effect.
Log File: /tmp/s59xQUEIEL
Real log file: /tmp/MLNX_OFED_LINUX.17643.logs/fw_update.log
Device (83:00.0):
83:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Link Width: x16
PCI Link Speed: 16GT/s
Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
root@host:~/MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64# /etc/init.d/openibd restart
Unloading HCA driver: [ OK ]
Loading HCA driver and Access Layer: [ OK ]
测试
root@hostname:~hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-24.04-0.7.0.0 (OFED-24.04-0.7.0): 5.15.0-118-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v20.41.1000
20.39.1002
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... UP 4X HDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... 58:a2:d6:52
------------------ DONE ---------------------
验证
root@host:~# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x58a2e103052
System image GUID: 0x58a2e10300a9d652
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 9
LMC: 0
SM lid: 1
Capability mask: 0xa6518
Port GUID: 0x58a2e0a9d652
Link layer: InfiniBand
root@host:~# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:58a2:00a9:d652
base lid: 0x9
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
配置ip地址就和正常的网卡配置IP地址一样
注意的点是只配置IP地址和掩码即可,无需配置网关
cat /etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
ethernets:
ens1f0:
dhcp4: false
addresses: [xxx]
optional: true
routes:
- to: default
via: xxx
nameservers:
addresses:
- xxx
ibs108: # 这里是ib网卡的配置
dhcp4: false
addresses: [xxx]
nameservers:
addresses:
- xxx
ens1f1:
dhcp4: true
ens5f0:
dhcp4: false
ens5f1:
dhcp4: false
ens5f2:
dhcp4: false
ens5f3:
dhcp4: false
enxda5cac3f11ab:
dhcp4: true
version: 2
netplan apply 即可
可以为ib网卡配置优化脚本
如增大MTU值
测试参数稍后补充
ib交换机配置
使用串口consle线连接ib交换机需要调整波特率为115200,也需要看具体的型号,有一些型号是9600,其他的没啥要注意的
和普通交换机差不多,要确认子网管理器已正常启用,
为管理口配置IP地址
配置ssh登录密码
<hostc>ssh IP地址
Username: admin
Press CTRL+C to abort.
Connecting to IP地址 port 22.
Mellanox MLNX-OS Switch Management
Password:
Enter a character ~ and a dot to abort.
Last login: Wed Aug 21 13:00:24 UTC 2024 from xxx on pts/0
Number of total successful connections since last 1 days: 15
Mellanox Switch
switch-4a128e [standalone: master] >
进入特权模式enable
进入配置模式 conf t
查看配置show running-config
查看端口配置状态,是否启用ib协议等信息
switch-4a128e [standalone: master] (config) # show interfaces ib status
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Interface Description IB Subnet Speed Current line rate Logical port state Physical port state
---------------------------------------------------------------------------------------------------------------------------------------------------------------
IB1/1 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/2 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/3 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/4 infiniband-default - - Down Polling
IB1/5 infiniband-default - - Down Polling
IB1/6 infiniband-default - - Down Polling
IB1/7 infiniband-default - - Down Polling
IB1/8 infiniband-default - - Down Polling
IB1/9 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/10 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/11 infiniband-default - - Down Polling
IB1/12 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/13 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/14 infiniband-default - - Down Polling
IB1/15 infiniband-default - - Down Polling
IB1/16 infiniband-default - - Down Polling
IB1/17 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/18 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/19 infiniband-default hdr 200.0 Gbps Active LinkUp
IB1/20 infiniband-default - - Down Polling
xxxxxxx省略信息
保存配置
switch-4a128e [standalone: master] # write memory
查看SN
switch-4a128e [standalone: master] # show inventory
refer
ib网卡配置
https://blog.csdn.net/laijianzong/article/details/127545152
Mellanox Technologies Ltd介绍
https://36kr.com/p/2368067301091716
ib交换机配置参考
https://www.hua-hang.cn/case/269.html
学习ib架构参考
https://blog.csdn.net/sz_woshishazi/category_12032159.html