CentOS 7下CX5-RDMA网络测试

RDMA(Remote Direct Memory Access) 全称远程直接数据存取,就是为了解决网络传输中服务器端数据处理的延迟而产生的。RDMA 通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和 CPU 周期用于改进应用系统性能。RDMA需要智能网卡支持,这里使用的是Mellanox cx5。

基于CentOS 7.8 x86_64

1. 识别CX5 网卡
#lspci |grep Mellanox
5e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
5e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
2. 安装MLNX驱动

选择下载与OS匹配的驱动包,地址:https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/。

官方文档:https://enterprise-support.nvidia.com/s/article/howto-install-mlnx-ofed-driver

#下载 MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#yum install createrepo
#yum install tcl fuse-libs tk
#tar zxvf MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#cd MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64
#./mlnxofedinstall  --add-kernel-support --with-nvmf --force    #我这里后面要测试nvmeof rdma
#dracut -f
# /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]
#systemctl enable openibd
#reboot
# lsmod |grep -i nvme
nvme                   47306  8 
nvme_core              94686  5 nvme
mlx_compat             55285  13 nvme,rdma_cm,ib_cm,iw_cm,auxiliary,mlx5_ib,nvme_core,ib_core,ib_umad,ib_uverbs,mlx5_core,rdma_ucm,ib_ipoib
3. 检查设备
#ibdev2netdev
mlx5_0 port 1 ==> eth2 (Up)
mlx5_1 port 1 ==> eth3 (Up)
#ibstatus
Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:1270:fdff:fe27:f982
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            25 Gb/sec (1X EDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:1270:fdff:fe27:f983
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            25 Gb/sec (1X EDR)
        link_layer:      Ethernet

常用命令

  • ibstat: 查询 InfiniBand 设备的基本状态
  • ibstatus: 网卡信息
  • ibv_devinfo:网卡设备信息(ibv_devinfo -d mlx5_0 -v)
  • ibv_devices:查看本主机的 infiniband 设备
  • ibnodes:查看网络中的 infiniband 设备
  • show_gids:看看网卡支持的 roce 版本
  • show_counters: 网卡端口统计数据,比如发送接受数据大小
  • mlxconfig: 网卡配置(mlxconfig -d mlx5_1 q 查询网卡配置信息)

吞吐量测试

写吞吐量

在 RDMA 驱动安装时会安装一些 RDMA 工具,可以使用 ib_send_bw 测试写吞吐量

服务器 A(server):

ib_write_bw -a -d mlx5_0

服务器 B(client):

ib_write_bw -a -d mlx5_0 10.192.51.152 (server端ip)

读吞吐量

读吞吐量的测试与写吞吐量测试相同,只是使用命令换为 ib_read_bw

延时测试

测试同样分为读写,测试工具为 ib_read_latib_write_lat

延时测试

测试同样分为读写,测试工具为 ib_read_latib_write_lat

带宽统计

在使用 RDMA 时,发送和接收的数据带宽可以在 app 中自己进行收集,这样我们的程序发送和接收的数据量会很清楚。
如果想知道当前 RDMA 网卡所发送和接收的带宽可以通过 sysfs 下的相关节点获取。

  • 发送数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data
  • 接收数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data

port_xmit_dataport_rcv_data 的数值是实际的 1/4, 因此实际的带宽是在其基础之上乘以 4,应该是为了防止数据溢出

port_xmit_data: (RO) Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter
port_rcv_data: (RO) Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.

来自: Documentation/ABI/stable/sysfs-class-infiniband

pma_cnt_ext->port_xmit_data =
    cpu_to_be64(MLX5_SUM_CNT(out, transmitted_ib_unicast.octets,
                 transmitted_ib_multicast.octets) >> 2);
pma_cnt_ext->port_rcv_data =
    cpu_to_be64(MLX5_SUM_CNT(out, received_ib_unicast.octets,
                 received_ib_multicast.octets) >> 2);

file: drivers/infiniband/hw/mlx5/mad.c

网络联通性测试

由于当前网卡只支持 Ethernet 模式,因此只能使用 ibv_rc_pingpong 进行 ping 测试。

  • https://community.mellanox.com/s/article/RoCE-Debug-Flow-for-Linux

服务器A

# ibv_rc_pingpong -d mlx5_0 -g 0
  local address:  LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
  remote address: LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
8192000 bytes in 0.01 seconds = 6475.25 Mbit/sec
1000 iters in 0.01 seconds = 10.12 usec/iter

client B

# ibv_rc_pingpong -d mlx5_1 -g 0 10.192.51.152
  local address:  LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
  remote address: LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
8192000 bytes in 0.01 seconds = 6746.55 Mbit/sec
1000 iters in 0.01 seconds = 9.71 usec/iter

counters

# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 excessive_buffer_overrun_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 link_downed
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 link_error_recovery
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 local_link_integrity_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 multicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 multicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_data
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_remote_physical_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_switch_relay_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_data
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_discards
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_wait
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 symbol_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 unicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 unicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 VL15_dropped

Counter Description:

CounterDescriptionInfiniBand Spec NameGroup
port_rcv_dataThe total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.PortRcvDataInformative
port_rcv_packetsTotal number of packets (this may include packets containing Errors. This is 64 bit counter.PortRcvPktsInformative
port_multicast_rcv_packetsTotal number of multicast packets, including multicast packets containing errors.PortMultiCastRcvPktsInformative
port_unicast_rcv_packetsTotal number of unicast packets, including unicast packets containing errors.PortUnicastRcvPktsInformative
port_xmit_dataThe total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.PortXmitDataInformative
port_xmit_packetsport_xmit_packets_64Total number of packets transmitted on all VLs from this port. This may include packets with errors.This is 64 bit counter.PortXmitPktsInformative
port_rcv_switch_relay_errorsTotal number of packets received on the port that were discarded because they could not be forwarded by the switch relay.PortRcvSwitchRelayErrorsError
port_rcv_errorsTotal number of packets containing an error that were received on the port.PortRcvErrorsInformative
port_rcv_constraint_errorsTotal number of packets received on the switch physical port that are discarded.PortRcvConstraintErrorsError
local_link_integrity_errorsThe number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors.LocalLinkIntegrityErrorsError
port_xmit_waitThe number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).PortXmitWaitInformative
port_multicast_xmit_packetsTotal number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors.PortMultiCastXmitPktsInformative
port_unicast_xmit_packetsTotal number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors.PortUnicastXmitPktsInformative
port_xmit_discardsTotal number of outbound packets discarded by the port because the port is down or congested.PortXmitDiscardsError
port_xmit_constraint_errorsTotal number of packets not transmitted from the switch physical port.PortXmitConstraintErrorsError
port_rcv_remote_physical_errorsTotal number of packets marked with the EBP delimiter received on the port.PortRcvRemotePhysicalErrorsError
symbol_errorTotal number of minor link errors detected on one or more physical lanes.SymbolErrorCounterError
VL15_droppedNumber of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port.VL15DroppedError
link_error_recoveryTotal number of times the Port Training state machine has successfully completed the link error recovery process.LinkErrorRecoveryCounterError
link_downedTotal number of times the Port Training state machine has failed the link error recovery process and downed the link.LinkDownedCounterError

hw_counters

# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 duplicate_request
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 implied_nak_seq_err
0 -rw-r--r-- 1 root root 4.0K 5月  28 16:42 lifespan
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 local_ack_timeout_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 np_cnp_sent
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 np_ecn_marked_roce_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 out_of_buffer
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 out_of_sequence
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 packet_seq_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_remote_invalid_request
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_local_length_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rnr_nak_retry_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rp_cnp_handled
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rp_cnp_ignored
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_atomic_requests
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_icrc_encapsulated
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_read_requests
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_write_requests

HW Counters Description:

CounterDescriptionGroup
duplicate_requestNumber of received packets. A duplicate request is a request that had been previously executed.Error
implied_nak_seq_errNumber of time the requested decided an ACK. with a PSN larger than the expected PSN for an RDMA read or response.Error
lifespanThe maximum period in ms which defines the aging of the counter reads. Two consecutive reads within this period might return the same valuesInformative
local_ack_timeout_errThe number of times QP’s ack timer expired for RC, XRC, DCT QPs at the sender side.The QP retry limit was not exceed, therefore it is still recoverable error.Error
np_cnp_sentThe number of CNP packets sent by the Notification Point when it noticed congestion experienced in the RoCEv2 IP header (ECN bits).The counters was added in MLNX_OFED 4.1Informative
np_ecn_marked_roce_packetsThe number of RoCEv2 packets received by the notification point which were marked for experiencing the congestion (ECN bits where ‘11’ on the ingress RoCE traffic) .The counters was added in MLNX_OFED 4.1Informative
out_of_bufferThe number of drops occurred due to lack of WQE for the associated QPs.Error
out_of_sequenceThe number of out of sequence packets received.Error
packet_seq_errThe number of received NAK sequence error packets. The QP retry limit was not exceeded.Error
req_cqe_errorThe number of times requester detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1Error
req_cqe_flush_errorThe number of times requester detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1Error
req_remote_access_errorsThe number of times requester detected remote access errors.The counters was added in MLNX_OFED 4.1Error
req_remote_invalid_requestThe number of times requester detected remote invalid request errors.The counters was added in MLNX_OFED 4.1Error
resp_cqe_errorThe number of times responder detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1Error
resp_cqe_flush_errorThe number of times responder detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1Error
resp_local_length_errorThe number of times responder detected local length errors.The counters was added in MLNX_OFED 4.1Error
resp_remote_access_errorsThe number of times responder detected remote access errors.The counters was added in MLNX_OFED 4.1Error
rnr_nak_retry_errThe number of received RNR NAK packets. The QP retry limit was not exceeded.Error
rp_cnp_handledThe number of CNP packets handled by the Reaction Point HCA to throttle the transmission rate.The counters was added in MLNX_OFED 4.1Informative
rp_cnp_ignoredThe number of CNP packets received and ignored by the Reaction Point HCA. This counter should not raise if RoCE Congestion Control was enabled in the network. If this counter raise, verify that ECN was enabled on the adapter. See HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux).The counters was added in MLNX_OFED 4.1Error
rx_atomic_requestsThe number of received ATOMIC request for the associated QPs.Informative
rx_dct_connectThe number of received connection request for the associated DCTs.Informative
rx_read_requestsThe number of received READ requests for the associated QPs.Informative
rx_write_requestsThe number of received WRITE requests for the associated QPs.Informative
rx_icrc_encapsulatedThe number of RoCE packets with ICRC errors.This counter was added in MLNX_OFED 4.4 and kernel 4.19Error
roce_adp_retransCounts the number of adaptive retransmissions for RoCE trafficThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_adp_retrans_toCounts the number of times RoCE traffic reached timeout due to adaptive retransmissionThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restartCounts the number of times RoCE slow restart was usedThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restart_cnpsCounts the number of times RoCE slow restart generated CNP packetsThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restart_transCounts the number of times RoCE slow restart changed state to slow restartThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
  • duplicate_request:(Duplicated packets)接收报文数,重复请求是先前已执行的请求。
  • out_of_sequence:(Drop out of sequence)接收到的乱序包的数量,说明此时已经产生了丢包
  • packet_seq_err:(NAK sequence rcvd)接收到的 NAK 序列错误数据包的数量,未超过 QP 重试限制。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

robin5911

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值