ubuntu20.04 (Linux 5.4.0-81-generic )安装驱动调试Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
参考1:家用万兆网络指南 1 - 不如先来个最简单的100G网络 - 知乎 (zhihu.com)
参考2:[Centos7.9(3.10.0-1160.83.1.el7.x86_64内核)系统里安装和调试Infiniband网卡(MT27500 Family ConnectX-3])驱动 - 知乎 (zhihu.com)
参考3:FAQ-IB常用命令- 华为 (huawei.com)
系统版本和内核版本
root@vnet:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
root@vnet:~# uname -rs
Linux 5.4.0-81-generic
插入网卡,重启,查看Infiniband
lspci |grep Infiniband
ca:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
ca:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
去官网下载驱动 NVIDIA InfiniBand Software | NVIDIA
MLNX_OFED | v5.9-0.5.6.0 |
---|---|
Note: By downloading and installing MLNX_OFED package for Oracle Linux (OL) OS, you may be violating your operating system’s support matrix. Please consult with your operating system support before installing.
Note: MLNX_OFED 4.9-x LTS should be used by customers who would like to utilize one of the following:
- NVIDIA ConnectX-3 Pro
- NVIDIA ConnectX-3
- NVIDIA Connect-IB
- RDMA experimental verbs library (mlnx_lib)
- OSs based on kernel version lower than 3.10
Note: All of the above are not available on MLNX_OFED 5.x branch.
Note: MLNX_OFED 5.4/5.8-x LTS should be used by customers who would like to utilize NVIDIA ConnectX-4 onwards adapter cards and keep using stable 5.4/5.8-x deployment and get:
- Critical bug fixes
- Support for new major OSs
使用页面下载,然后移动到 linux下解压
校验sha256
root@vnet:~# sha256sum MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
99aa2966ce260f3ca282e24a26c6f5302692f9072117626107aa599868208d8f MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
解压
tar -zxf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
安装
cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64/
./mlnxofedinstall
这里省略部分日志
选择y
出现红色提示意思是:自动尝试帮你安装一些依赖,这里耐心等待它自动处理
安装结束提示需要执行 /etc/init.d/openibd restart 加装新驱动
开启 openibd
openibd 是网卡需要的 daemon 程序, 并且会给内核加载需要的mod, 我们开启并设置为开机启动.
root@vnet:~/MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64# /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ]
查看状态
systemctl status openibd
使用 ibv_devinfo 命令查看设备信息
[root@172-0-1-167 ~]# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.35.2000
node_guid: 1070:fd03:0079:cf64
sys_image_guid: 1070:fd03:0079:cf64
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000008
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
或者 检查IB状态
[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.35.2000
Hardware version: 0
Node GUID: 0x1070fd030079cf64
System image GUID: 0x1070fd030079cf64
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0xa651e84a
Port GUID: 0x1070fd030079cf64
Link layer: InfiniBand
ibv_devinfo这个命令可以方便的看到网卡端口的连接状态和协议, 我们可以看到 state: PORT_ACTIVE (4)说明网线插好了,state: PORT_DOWN (1), 说明线没插好. link_layer: InfiniBand 说明网卡运行在 IB 模式
ibstat // ib卡State为 active 并且 Link Layer 为: InfiniBand 则正常
查看子网内的所有节点
[root@172-0-1-167 ~]# ibnodes
Ca : 0xe8ebd30300a38376 ports 1 "vnet HCA-1"
Ca : 0x1070fd030079cf64 ports 1 "172-0-1-167 HCA-1"
OFED 中提供了 iblinkinfo 命令, 可以让我们方便查看 IB 网络的拓扑信息.
[root@172-0-1-167 ~]# iblinkinfo
CA: vnet HCA-1:
0xe8ebd30300a38376 2 1[ ] ==( 4X 25.78125 Gbps Active/ LinkUp)==> 1 1[ ] "172-0-1-167 HCA-1" ( )
CA: 172-0-1-167 HCA-1:
0x1070fd030079cf64 1 1[ ] ==( 4X 25.78125 Gbps Active/ LinkUp)==> 2 1[ ] "vnet HCA-1" ( )
上图可以看出,子网内有两个ib节点,一个结点
模式一:IB
连通性,使用ibping
在节点1上启动 ibping server
[root@172-0-1-167 ~]# ibping -S -C mlx5_0
---->此处会没有返回,也就是一直在运行.
---->解释:-S是以服务器端运行
-C是CA,来自ibstat的输出
-P是端口号,来自ibstat的输出 port: 1
在节点二上启动客户端
ibping -f -C mlx5_0 -L 1 -c 10
--->解释:-c 10的意思是发送10个packet之后停止.
-f是flood destination
-C是CA,来自ibstat的输出
-P是端口号,来自服务器端运行ibping命令时指定的-P 参数值.
-L是Base lid,来自服务器端运行ibping命令时的base lid(参考ibstat)
性能测试
目前没有找到原生IB的性能测试方法
模式二:IP over IB
查看 ifconfig ,发现有Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8) 这个报错,网上说是这个ifconfig显示问题,不影响使用,我暂时没有去确认这个问题
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:10:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 664 bytes 79636 (77.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 563 bytes 46668 (45.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ib1: flags=4099<UP,BROADCAST,MULTICAST> mtu 4092
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:11:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
#配置临时IP
#节点1配置如下
ifconfig ib0 192.168.1.1/24
#配置网关为本机ip
route add default gw 192.168.1.1 dev ib0
#节点2配置如下
ifconfig ibs6f0 192.168.1.2/24
#配置网关为节点1的ip
route add default gw 192.168.1.1 dev ibs6f0
#注意,添加网关后,节点有可能上不了外网,如果需要上外网,需要把新加的网关删掉
#删除网关的方法
#route del default gw 192.168.1.1 dev ibs6f0
测试1:连通性测试
在192.168.1.1上操作
[root@localhost ~]# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.040 ms
测试2:性能测试
利用 OFED 提供的 mlnx_tune 工具可以直接自动检查性能瓶颈
[root@localhost ~]# mlnx_tune -r
对网络进行调优
mlnx_tune -p HIGH_THROUGHPUT
root@vnet:~# iperf3 -c 192.168.1.1 -i 5
Connecting to host 192.168.1.1, port 5201
[ 5] local 192.168.1.2 port 53830 connected to 192.168.1.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-5.00 sec 29.6 GBytes 50.8 Gbits/sec 0 803 KBytes
[ 5] 5.00-10.00 sec 32.3 GBytes 55.5 Gbits/sec 0 803 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 61.9 GBytes 53.2 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 61.9 GBytes 53.2 Gbits/sec receiver
发现iperf3的速度在30–60Gb/s 左右,这个是因为iperf3是单核,很难达到100Gb
参考链接:iperf3 at 40Gbps and above 。
改用iperf
root@vnet:~# iperf -c 192.168.1.1 -P 4
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[ 7] local 192.168.1.2 port 34322 connected with 192.168.1.1 port 5001
[ 3] local 192.168.1.2 port 34294 connected with 192.168.1.1 port 5001
[ 5] local 192.168.1.2 port 34310 connected with 192.168.1.1 port 5001
[ 6] local 192.168.1.2 port 34320 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 7] 0.0-10.0 sec 27.0 GBytes 23.2 Gbits/sec
[ 3] 0.0-10.0 sec 27.0 GBytes 23.2 Gbits/sec
[ 5] 0.0-10.0 sec 27.0 GBytes 23.2 Gbits/sec
[ 6] 0.0-10.0 sec 27.0 GBytes 23.2 Gbits/sec
[SUM] 0.0-10.0 sec 108 GBytes 92.9 Gbits/sec
iperf测试tcp性能是92.9 Gbits/sec,接近100,不过还是有点差距,目前没有找到瓶颈
测试3:qperf
root@vnet:~# qperf 192.168.1.1 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat udp_bw udp_lat
ud_lat:
latency = 4.15 us
ud_bw:
send_bw = 10.1 GB/sec
recv_bw = 10.1 GB/sec
rc_rdma_read_bw:
bw = 12 GB/sec
rc_rdma_write_bw:
bw = 12 GB/sec
uc_rdma_write_bw:
send_bw = 12 GB/sec
recv_bw = 12 GB/sec
tcp_bw:
bw = 4.45 GB/sec
tcp_lat:
latency = 8.36 us
udp_bw:
send_bw = 2.86 GB/sec
recv_bw = 2.86 GB/sec
udp_lat:
latency = 7.35 us
这些是各种通信协议的性能指标:
-
ud_lat
:使用UD(不可靠数据报)协议发送和接收消息的延迟为4.15微秒。 -
ud_bw
:使用UD协议发送和接收消息的带宽为10.1 GB / 秒,约合80Gb/s。 -
rc_rdma_read_bw
:使用RC(可靠连接)协议执行远程直接内存访问(RDMA)读取的带宽为12 GB / 秒,约合 96Gb/s。 -
rc_rdma_write_bw
:使用RC协议执行RDMA写入的带宽为12 GB / 秒。 -
uc_rdma_write_bw
:使用UC(不可靠连接)协议执行RDMA写入的带宽为12 GB / 秒。 -
tcp_bw
:使用TCP(传输控制协议)协议发送和接收数据的带宽为4.45 GB / 秒 ,约合35.6Gb/s。 -
tcp_lat
:使用TCP协议发送和接收数据的延迟为8.36微秒。 -
udp_bw
:使用UDP(用户数据报协议)协议发送和接收数据的带宽为2.86 GB / 秒, 约合22.88Gb/s。 -
udp_lat
:使用UDP协议发送和接收数据的延迟为7.35微秒。
测试4:ib_send_bw
一端启动
ib_send_bw -d mlx5_0
另一端通过ip连接
root@vnet:~# ib_send_bw -d mlx5_0 192.168.1.1
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0058 PSN 0x4959ba
remote address: LID 0x01 QPN 0x0057 PSN 0x658b82
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 11497.50 11497.00 0.183952
---------------------------------------------------------------------------------------
本测试中共发送了1000个大小为65536字节的消息。带宽测试得到的峰值带宽为11497.50 MB/s,平均带宽为11497.00 MB/s 约合 92 Gb/s。消息吞吐量为0.183952 Mpps,即每秒钟可以发送约1839条65536字节的消息。
测试5:ib_send_lat 延时测试
root@vnet:~# ib_send_lat -d mlx5_0 6.6.6.6
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 236[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0059 PSN 0xec24c
remote address: LID 0x01 QPN 0x0058 PSN 0xd1a3f5
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 0.90 2.47 0.94 0.95 0.04 0.98 2.47
---------------------------------------------------------------------------------------
本测试中共发送了1000个大小为2字节的消息,最小延迟为0.9微秒,最大延迟为2.47微秒,典型延迟为0.94微秒,平均延迟为0.95微秒,标准差为0.04微秒。99%的消息的延迟小于等于0.98微秒,而99.9%的消息的延迟小于等于2.47微秒
接下来我们把网卡修改成以太网模式.
开启 mst
mst 可以更方便的管理网卡, 查看网卡信息.
root@vnet:~# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
root@vnet:~# mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:ca:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
修改网卡为以太网模式为2
查看网卡模式
root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 q |grep LINK_TYPE_P
LINK_TYPE_P1 IB(1)
LINK_TYPE_P2 IB(1)
这里看到1 是IB模式
root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2
root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P2=2
Device #1:
----------
Device type: ConnectX5
Name: MCX556A-ECA_Ax
Description: ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device: /dev/mst/mt4119_pciconf0
Configurations: Next Boot New
LINK_TYPE_P1 IB(1) ETH(2)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
注意这里的 -d 命令后面的参数是 mst status 看到的网卡路径 /dev/mst/mt4119_pciconf0. set LINK_TYPE_P1=2, 是把网卡设置到模式2(ETH,以太网, 模式1是IB).
reboot 重启机器
ubuntu 系统在重启机器以后,通过ip a 不确定对应的interface接口名称
root@vnet:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno8303: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether b4:45:06:ee:42:1d brd ff:ff:ff:ff:ff:ff
3: eno12399: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:60:e1:50 brd ff:ff:ff:ff:ff:ff
4: ens6f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether e8:eb:d3:a3:83:76 brd ff:ff:ff:ff:ff:ff
5: eno12409: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:60:e1:51 brd ff:ff:ff:ff:ff:ff
6: eno8403: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b4:45:06:ee:42:1e brd ff:ff:ff:ff:ff:ff
inet 172.0.1.168/24 brd 172.0.1.255 scope global eno8403
valid_lft forever preferred_lft forever
inet6 fe80::b645:6ff:feee:421e/64 scope link
valid_lft forever preferred_lft forever
7: idrac: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b4:45:06:ef:11:82 brd ff:ff:ff:ff:ff:ff
8: ibs6f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 00:00:05:c7:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:a3:83:77 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
上面看到 ibs6f1 这个是另外一个接口,原来的ibs6f0不见了
通过pci找到对应的interface接口
root@vnet:~# lspci |grep Mellanox
98:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
98:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
root@vnet:~#
root@vnet:~#
root@vnet:~#
root@vnet:~# lshw -class network -businfo
Bus info Device Class Description
=======================================================
pci@0000:04:00.0 eno8303 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
pci@0000:04:00.1 eno8403 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
pci@0000:31:00.0 eno12399 network Ethernet Controller X710 for 10GbE SFP+
pci@0000:31:00.1 eno12409 network Ethernet Controller X710 for 10GbE SFP+
pci@0000:98:00.0 ens6f0np0 network MT27800 Family [ConnectX-5]
pci@0000:98:00.1 ibs6f1 network MT27800 Family [ConnectX-5]
usb@1:14.3 idrac network Ethernet interface
通过lshw的输出可以看到 pci@0000:98:00.0 对应的接口是 ens6f0np0 ,名字变了,不再是ibs6f0
此时的两台主机的ib状态ibstat状况都是down
root@vnet:~# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.35.2000
Hardware version: 0
Node GUID: 0xe8ebd30300a38376
System image GUID: 0xe8ebd30300a38376
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffea38376
Link layer: Ethernet
[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.35.2000
Hardware version: 0
Node GUID: 0x1070fd030079cf64
System image GUID: 0x1070fd030079cf64
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 1
LMC: 0
SM lid: 2
Capability mask: 0xa651e84a
Port GUID: 0x1070fd030079cf64
Link layer: InfiniBand
临时配置原来的ip地址,是不可以互通的
ifconfig ib0 192.168.1.1/24
ifconfig ens6f0np0 192.168.1.2/24 up
root@vnet:~# ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
From 192.168.1.2 icmp_seq=1 Destination Host Unreachable
From 192.168.1.2 icmp_seq=2 Destination Host Unreachable
From 192.168.1.2 icmp_seq=3 Destination Host Unreachable
同样的方式,另外一边也需要配置成 Ethernet 模式,不用启动opensm 也是可以工作的
[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.35.2000
Hardware version: 0
Node GUID: 0x1070fd030079cf64
System image GUID: 0x1070fd030079cf64
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x1270fdfffe79cf64
Link layer: Ethernet
上面可以看到 State: Active Physical state: LinkUp 表示接口可用,但是由于是Link layer: Ethernet 所以Base lid: 0 不能使用 LID进行通信 ,ibping不可用
[root@172-0-1-167 ~]# ibdump -w Ethernet.pcap
Initiating resources ...
searching for IB devices in host
Port active_mtu=1024
MR was registered with addr=0x1db4840, lkey=0x500f, rkey=0x500f, flags=0x1
------------------------------------------------
Device : "mlx5_0"
Physical port : 1
Link layer : Ethernet
Dump file : Ethernet.pcap
Sniffer WQEs (max burst size) : 4096
------------------------------------------------
Failed to set port sniffer1: command interface bad params
Ethernet模式下ibdump不可用了,应该是应为没有了sniffer ,抓包抓包需要使用tcpdump,使用docker的方式,参考下面:
No sniffer flag using ethtool –show-priv-flags
(59条消息) 【网络】TCP抓包|RDMA抓包|ibdump、tcpdump用法说明_bandaoyu的博客-CSDN博客
参考【2.tcpdump (docker,Linux内核从4.9以上)】
docker pull mellanox / tcpdump-rdma
docker run -it -v /dev/infiniband:/dev/infiniband -v /tmp/traces:/tmp/traces --net=host --privileged mellanox/tcpdump-rdma bash
tcpdump -i mlx5_0 -s 0 -w /tmp/traces/capture1.pcap
Server : ib_write_bw -d mlx5_0 -n 100 -R
Client: ib_write_bw -d mlx5_0 -n 100 -R 6.6.6.6
通过find指令找到抓包文件
find / -name capture1.pcap
默认是rocev 2
:~# cma_roce_mode -d mlx5_0
RoCE v2
使用-m 1 表示是使用rocev1 -m 2 表示rocev2
# cma_roce_mode -d mlx5_0 -m 1
IB/RoCE v1
测试:iperf
root@vnet:~# iperf -c 192.168.1.1 -P 4
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.2 port 37344 connected with 192.168.1.1 port 5001
[ 4] local 192.168.1.2 port 37318 connected with 192.168.1.1 port 5001
[ 5] local 192.168.1.2 port 37322 connected with 192.168.1.1 port 5001
[ 6] local 192.168.1.2 port 37328 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 25.9 GBytes 22.2 Gbits/sec
[ 4] 0.0-10.0 sec 25.8 GBytes 22.1 Gbits/sec
[ 5] 0.0-10.0 sec 28.9 GBytes 24.9 Gbits/sec
[ 6] 0.0-10.0 sec 29.0 GBytes 24.9 Gbits/sec
[SUM] 0.0-10.0 sec 110 GBytes 94.1 Gbits/sec
这里看到iperf达到 94.1 Gbits/sec ,比IB模式的 92.9 Gbits/sec 高一点,
cpu很高
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10645 root 20 0 393856 5092 1868 S 251.2 0.0 1:10.49 iperf