ubuntu20.04 (Linux 5.4.0-81-generic )安装驱动调试Infiniband controller: Mellanox [ConnectX-5]

ubuntu20.04 (Linux 5.4.0-81-generic )安装驱动调试Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

参考1:家用万兆网络指南 1 - 不如先来个最简单的100G网络 - 知乎 (zhihu.com)

参考2:[Centos7.9(3.10.0-1160.83.1.el7.x86_64内核)系统里安装和调试Infiniband网卡(MT27500 Family ConnectX-3])驱动 - 知乎 (zhihu.com)

参考3:FAQ-IB常用命令- 华为 (huawei.com)

系统版本和内核版本

root@vnet:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
root@vnet:~# uname -rs
Linux 5.4.0-81-generic

插入网卡,重启,查看Infiniband

lspci |grep Infiniband
ca:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
ca:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

去官网下载驱动 NVIDIA InfiniBand Software | NVIDIA

MLNX_OFEDv5.9-0.5.6.0

Note: By downloading and installing MLNX_OFED package for Oracle Linux (OL) OS, you may be violating your operating system’s support matrix. Please consult with your operating system support before installing.

Note: MLNX_OFED 4.9-x LTS should be used by customers who would like to utilize one of the following:

  • NVIDIA ConnectX-3 Pro
  • NVIDIA ConnectX-3
  • NVIDIA Connect-IB
  • RDMA experimental verbs library (mlnx_lib)
  • OSs based on kernel version lower than 3.10

Note: All of the above are not available on MLNX_OFED 5.x branch.

Note: MLNX_OFED 5.4/5.8-x LTS should be used by customers who would like to utilize NVIDIA ConnectX-4 onwards adapter cards and keep using stable 5.4/5.8-x deployment and get:

  • Critical bug fixes
  • Support for new major OSs

使用页面下载,然后移动到 linux下解压

校验sha256
root@vnet:~# sha256sum MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
99aa2966ce260f3ca282e24a26c6f5302692f9072117626107aa599868208d8f  MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
解压
tar -zxf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
安装
cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64/
./mlnxofedinstall
这里省略部分日志
选择y

出现红色提示意思是:自动尝试帮你安装一些依赖,这里耐心等待它自动处理

安装结束提示需要执行 /etc/init.d/openibd restart 加装新驱动

开启 openibd

openibd 是网卡需要的 daemon 程序, 并且会给内核加载需要的mod, 我们开启并设置为开机启动.

root@vnet:~/MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64# /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ]

查看状态

systemctl status openibd

使用 ibv_devinfo 命令查看设备信息

[root@172-0-1-167 ~]# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.2000
        node_guid:                      1070:fd03:0079:cf64
        sys_image_guid:                 1070:fd03:0079:cf64
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000008
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

或者 检查IB状态
[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.35.2000
        Hardware version: 0
        Node GUID: 0x1070fd030079cf64
        System image GUID: 0x1070fd030079cf64
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0xa651e84a
                Port GUID: 0x1070fd030079cf64
                Link layer: InfiniBand

ibv_devinfo这个命令可以方便的看到网卡端口的连接状态和协议, 我们可以看到 state: PORT_ACTIVE (4)说明网线插好了,state: PORT_DOWN (1), 说明线没插好. link_layer: InfiniBand 说明网卡运行在 IB 模式

ibstat // ib卡State为 active 并且 Link Layer 为: InfiniBand 则正常

查看子网内的所有节点

[root@172-0-1-167 ~]# ibnodes
Ca      : 0xe8ebd30300a38376 ports 1 "vnet HCA-1"
Ca      : 0x1070fd030079cf64 ports 1 "172-0-1-167 HCA-1"

OFED 中提供了 iblinkinfo 命令, 可以让我们方便查看 IB 网络的拓扑信息.

[root@172-0-1-167 ~]# iblinkinfo
CA: vnet HCA-1:
      0xe8ebd30300a38376      2    1[  ] ==( 4X      25.78125 Gbps Active/  LinkUp)==>       1    1[  ] "172-0-1-167 HCA-1" ( )
CA: 172-0-1-167 HCA-1:
      0x1070fd030079cf64      1    1[  ] ==( 4X      25.78125 Gbps Active/  LinkUp)==>       2    1[  ] "vnet HCA-1" ( )

上图可以看出,子网内有两个ib节点,一个结点

模式一:IB

连通性,使用ibping

在节点1上启动 ibping server
[root@172-0-1-167 ~]# ibping -S -C mlx5_0 
---->此处会没有返回,也就是一直在运行.
---->解释:-S是以服务器端运行
         -C是CA,来自ibstat的输出
         -P是端口号,来自ibstat的输出 port:   1
         
 在节点二上启动客户端         
ibping -f  -C mlx5_0 -L 1  -c 10         
--->解释:-c 10的意思是发送10个packet之后停止.
          -f是flood destination
          -C是CA,来自ibstat的输出
          -P是端口号,来自服务器端运行ibping命令时指定的-P 参数值.
          -L是Base lid,来自服务器端运行ibping命令时的base lid(参考ibstat)

性能测试

目前没有找到原生IB的性能测试方法

模式二:IP over IB

查看 ifconfig ,发现有Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8) 这个报错,网上说是这个ifconfig显示问题,不影响使用,我暂时没有去确认这个问题

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 00:00:10:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 664  bytes 79636 (77.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 563  bytes 46668 (45.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 4092
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 00:00:11:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

#配置临时IP
#节点1配置如下
ifconfig ib0 192.168.1.1/24
#配置网关为本机ip
route add default gw 192.168.1.1 dev ib0 
#节点2配置如下
ifconfig ibs6f0 192.168.1.2/24
#配置网关为节点1的ip
route add default gw 192.168.1.1 dev ibs6f0

#注意,添加网关后,节点有可能上不了外网,如果需要上外网,需要把新加的网关删掉
#删除网关的方法
#route del default gw 192.168.1.1 dev ibs6f0

测试1:连通性测试

在192.168.1.1上操作
[root@localhost ~]# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.040 ms

测试2:性能测试

利用 OFED 提供的 mlnx_tune 工具可以直接自动检查性能瓶颈

[root@localhost ~]# mlnx_tune -r

对网络进行调优

mlnx_tune -p HIGH_THROUGHPUT
root@vnet:~# iperf3 -c 192.168.1.1 -i 5
Connecting to host 192.168.1.1, port 5201
[  5] local 192.168.1.2 port 53830 connected to 192.168.1.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-5.00   sec  29.6 GBytes  50.8 Gbits/sec    0    803 KBytes
[  5]   5.00-10.00  sec  32.3 GBytes  55.5 Gbits/sec    0    803 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  61.9 GBytes  53.2 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  61.9 GBytes  53.2 Gbits/sec                  receiver

发现iperf3的速度在30–60Gb/s 左右,这个是因为iperf3是单核,很难达到100Gb

参考链接:iperf3 at 40Gbps and above

改用iperf

root@vnet:~# iperf -c 192.168.1.1   -P 4
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[  7] local 192.168.1.2 port 34322 connected with 192.168.1.1 port 5001
[  3] local 192.168.1.2 port 34294 connected with 192.168.1.1 port 5001
[  5] local 192.168.1.2 port 34310 connected with 192.168.1.1 port 5001
[  6] local 192.168.1.2 port 34320 connected with 192.168.1.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  27.0 GBytes  23.2 Gbits/sec
[  3]  0.0-10.0 sec  27.0 GBytes  23.2 Gbits/sec
[  5]  0.0-10.0 sec  27.0 GBytes  23.2 Gbits/sec
[  6]  0.0-10.0 sec  27.0 GBytes  23.2 Gbits/sec
[SUM]  0.0-10.0 sec   108 GBytes  92.9 Gbits/sec

iperf测试tcp性能是92.9 Gbits/sec,接近100,不过还是有点差距,目前没有找到瓶颈

测试3:qperf

root@vnet:~# qperf 192.168.1.1 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat udp_bw udp_lat
ud_lat:
    latency  =  4.15 us
ud_bw:
    send_bw  =  10.1 GB/sec
    recv_bw  =  10.1 GB/sec
rc_rdma_read_bw:
    bw  =  12 GB/sec
rc_rdma_write_bw:
    bw  =  12 GB/sec
uc_rdma_write_bw:
    send_bw  =  12 GB/sec
    recv_bw  =  12 GB/sec
tcp_bw:
    bw  =  4.45 GB/sec
tcp_lat:
    latency  =  8.36 us
udp_bw:
    send_bw  =  2.86 GB/sec
    recv_bw  =  2.86 GB/sec
udp_lat:
    latency  =  7.35 us

这些是各种通信协议的性能指标:

  • ud_lat:使用UD(不可靠数据报)协议发送和接收消息的延迟为4.15微秒。

  • ud_bw:使用UD协议发送和接收消息的带宽为10.1 GB / 秒,约合80Gb/s。

  • rc_rdma_read_bw:使用RC(可靠连接)协议执行远程直接内存访问(RDMA)读取的带宽为12 GB / 秒,约合 96Gb/s。

  • rc_rdma_write_bw:使用RC协议执行RDMA写入的带宽为12 GB / 秒。

  • uc_rdma_write_bw:使用UC(不可靠连接)协议执行RDMA写入的带宽为12 GB / 秒。

  • tcp_bw:使用TCP(传输控制协议)协议发送和接收数据的带宽为4.45 GB / 秒 ,约合35.6Gb/s。

  • tcp_lat:使用TCP协议发送和接收数据的延迟为8.36微秒。

  • udp_bw:使用UDP(用户数据报协议)协议发送和接收数据的带宽为2.86 GB / 秒, 约合22.88Gb/s。

  • udp_lat:使用UDP协议发送和接收数据的延迟为7.35微秒。

测试4:ib_send_bw

一端启动
ib_send_bw -d mlx5_0
另一端通过ip连接
root@vnet:~# ib_send_bw -d mlx5_0 192.168.1.1
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0058 PSN 0x4959ba
 remote address: LID 0x01 QPN 0x0057 PSN 0x658b82
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             11497.50            11497.00                  0.183952
---------------------------------------------------------------------------------------

本测试中共发送了1000个大小为65536字节的消息。带宽测试得到的峰值带宽为11497.50 MB/s,平均带宽为11497.00 MB/s 约合 92 Gb/s。消息吞吐量为0.183952 Mpps,即每秒钟可以发送约1839条65536字节的消息。

测试5:ib_send_lat 延时测试

root@vnet:~# ib_send_lat -d mlx5_0 6.6.6.6
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 236[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0059 PSN 0xec24c
 remote address: LID 0x01 QPN 0x0058 PSN 0xd1a3f5
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000          0.90           2.47         0.94                0.95             0.04            0.98                    2.47
---------------------------------------------------------------------------------------

本测试中共发送了1000个大小为2字节的消息,最小延迟为0.9微秒,最大延迟为2.47微秒,典型延迟为0.94微秒,平均延迟为0.95微秒,标准差为0.04微秒。99%的消息的延迟小于等于0.98微秒,而99.9%的消息的延迟小于等于2.47微秒

接下来我们把网卡修改成以太网模式.

开启 mst

mst 可以更方便的管理网卡, 查看网卡信息.

root@vnet:~# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
root@vnet:~# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:ca:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

修改网卡为以太网模式为2

查看网卡模式
root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 q |grep  LINK_TYPE_P
         LINK_TYPE_P1                                IB(1)
         LINK_TYPE_P2                                IB(1)         
这里看到1 是IB模式

root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2
root@vnet:~# mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P2=2

Device #1:
----------

Device type:    ConnectX5
Name:           MCX556A-ECA_Ax
Description:    ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device:         /dev/mst/mt4119_pciconf0

Configurations:                                      Next Boot       New
         LINK_TYPE_P1                                IB(1)           ETH(2)

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

注意这里的 -d 命令后面的参数是 mst status 看到的网卡路径 /dev/mst/mt4119_pciconf0. set LINK_TYPE_P1=2, 是把网卡设置到模式2(ETH,以太网, 模式1是IB).

reboot 重启机器

ubuntu 系统在重启机器以后,通过ip a 不确定对应的interface接口名称

root@vnet:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno8303: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether b4:45:06:ee:42:1d brd ff:ff:ff:ff:ff:ff
3: eno12399: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:60:e1:50 brd ff:ff:ff:ff:ff:ff
4: ens6f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e8:eb:d3:a3:83:76 brd ff:ff:ff:ff:ff:ff
5: eno12409: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:60:e1:51 brd ff:ff:ff:ff:ff:ff
6: eno8403: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b4:45:06:ee:42:1e brd ff:ff:ff:ff:ff:ff
    inet 172.0.1.168/24 brd 172.0.1.255 scope global eno8403
       valid_lft forever preferred_lft forever
    inet6 fe80::b645:6ff:feee:421e/64 scope link
       valid_lft forever preferred_lft forever
7: idrac: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b4:45:06:ef:11:82 brd ff:ff:ff:ff:ff:ff
8: ibs6f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 00:00:05:c7:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:a3:83:77 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

上面看到 ibs6f1 这个是另外一个接口,原来的ibs6f0不见了

通过pci找到对应的interface接口

root@vnet:~# lspci |grep Mellanox
98:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
98:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
root@vnet:~#
root@vnet:~#
root@vnet:~#
root@vnet:~# lshw -class network -businfo
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  eno8303    network        NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
pci@0000:04:00.1  eno8403    network        NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
pci@0000:31:00.0  eno12399   network        Ethernet Controller X710 for 10GbE SFP+
pci@0000:31:00.1  eno12409   network        Ethernet Controller X710 for 10GbE SFP+
pci@0000:98:00.0  ens6f0np0  network        MT27800 Family [ConnectX-5]
pci@0000:98:00.1  ibs6f1     network        MT27800 Family [ConnectX-5]
usb@1:14.3        idrac      network        Ethernet interface

通过lshw的输出可以看到 pci@0000:98:00.0 对应的接口是 ens6f0np0 ,名字变了,不再是ibs6f0

此时的两台主机的ib状态ibstat状况都是down

root@vnet:~# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.35.2000
        Hardware version: 0
        Node GUID: 0xe8ebd30300a38376
        System image GUID: 0xe8ebd30300a38376
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xeaebd3fffea38376
                Link layer: Ethernet
                
[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.35.2000
        Hardware version: 0
        Node GUID: 0x1070fd030079cf64
        System image GUID: 0x1070fd030079cf64
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 1
                LMC: 0
                SM lid: 2
                Capability mask: 0xa651e84a
                Port GUID: 0x1070fd030079cf64
                Link layer: InfiniBand                

临时配置原来的ip地址,是不可以互通的

ifconfig ib0 192.168.1.1/24
ifconfig ens6f0np0 192.168.1.2/24 up

root@vnet:~# ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
From 192.168.1.2 icmp_seq=1 Destination Host Unreachable
From 192.168.1.2 icmp_seq=2 Destination Host Unreachable
From 192.168.1.2 icmp_seq=3 Destination Host Unreachable

同样的方式,另外一边也需要配置成 Ethernet 模式,不用启动opensm 也是可以工作的

[root@172-0-1-167 ~]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.35.2000
        Hardware version: 0
        Node GUID: 0x1070fd030079cf64
        System image GUID: 0x1070fd030079cf64
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x1270fdfffe79cf64
                Link layer: Ethernet

上面可以看到 State: Active Physical state: LinkUp 表示接口可用,但是由于是Link layer: Ethernet 所以Base lid: 0 不能使用 LID进行通信 ,ibping不可用

[root@172-0-1-167 ~]# ibdump -w Ethernet.pcap
Initiating resources ...
searching for IB devices in host
Port active_mtu=1024
MR was registered with addr=0x1db4840, lkey=0x500f, rkey=0x500f, flags=0x1
 ------------------------------------------------
 Device                         : "mlx5_0"
 Physical port                  : 1
 Link layer                     : Ethernet
 Dump file                      : Ethernet.pcap
 Sniffer WQEs (max burst size)  : 4096
 ------------------------------------------------

Failed to set port sniffer1: command interface bad params

Ethernet模式下ibdump不可用了,应该是应为没有了sniffer ,抓包抓包需要使用tcpdump,使用docker的方式,参考下面:

No sniffer flag using ethtool –show-priv-flags

(59条消息) 【网络】TCP抓包|RDMA抓包|ibdump、tcpdump用法说明_bandaoyu的博客-CSDN博客

参考【2.tcpdump (docker,Linux内核从4.9以上)】

docker pull mellanox / tcpdump-rdma

docker run -it -v /dev/infiniband:/dev/infiniband -v /tmp/traces:/tmp/traces --net=host --privileged mellanox/tcpdump-rdma bash

tcpdump -i mlx5_0 -s 0 -w /tmp/traces/capture1.pcap

Server : ib_write_bw -d mlx5_0 -n 100 -R 

Client: ib_write_bw -d mlx5_0 -n 100 -R  6.6.6.6

通过find指令找到抓包文件
find / -name capture1.pcap

默认是rocev 2

:~# cma_roce_mode -d mlx5_0
RoCE v2

使用-m 1 表示是使用rocev1   -m 2 表示rocev2
# cma_roce_mode -d mlx5_0 -m 1   
IB/RoCE v1

测试:iperf

root@vnet:~# iperf -c 192.168.1.1   -P 4
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[  3] local 192.168.1.2 port 37344 connected with 192.168.1.1 port 5001
[  4] local 192.168.1.2 port 37318 connected with 192.168.1.1 port 5001
[  5] local 192.168.1.2 port 37322 connected with 192.168.1.1 port 5001
[  6] local 192.168.1.2 port 37328 connected with 192.168.1.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  25.9 GBytes  22.2 Gbits/sec
[  4]  0.0-10.0 sec  25.8 GBytes  22.1 Gbits/sec
[  5]  0.0-10.0 sec  28.9 GBytes  24.9 Gbits/sec
[  6]  0.0-10.0 sec  29.0 GBytes  24.9 Gbits/sec
[SUM]  0.0-10.0 sec   110 GBytes  94.1 Gbits/sec 

这里看到iperf达到 94.1 Gbits/sec ,比IB模式的 92.9 Gbits/sec 高一点,

cpu很高

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10645 root      20   0  393856   5092   1868 S 251.2  0.0   1:10.49 iperf
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值