RoCE QoS configuration - Priority mapping


考虑了很久,为了表达准确一些,还是决定用英文来写。

1. Priority configuration for RoCE

Both VPI verbs and RDMA_CM provide API for RoCE QoS configuration. With mlnx_ofed installed on the system, application can use multiple methods to set priority on RoCE traffic. Details are provided in this section.

1.1. Trust L2 - PCP based QoS

PCP based QoS can be configured for RoCEv1 and RoCEv2 traffic.

1.1.1. With VPI verbs

QPs can be created with specific SL (Service Level, 4 bits, range 0-15) from application, i.e.:

struct ibv_qp_attr *attr;
attr->qp_state = IBV_QPS_RTR;
attr->ah_attr.sl = xxx;
ibv_modify_qp(qp, attr, flags);

Device driver maps SL to PCP field (3 bits) - PCP takes 3 LSB’s of SL, it stands for UP (User Priority), UP is used in VLAN header:

UP = SL & 0x7

UP would be mapped to HW Traffic Class, the mapping from UP to TC can be checked and modified by “mlnx_qos” with option “–prio_tc=LIST”.

The mapping from SL to TC in mode of PCP based QoS:

Service LevelUser PriorityHW Traffic Class
0, 80default 0, configurable with “mlnx_qos --prio_tc”
1, 91default 1, configurable with “mlnx_qos --prio_tc”
2, 102default 2, configurable with “mlnx_qos --prio_tc”
3, 113default 3, configurable with “mlnx_qos --prio_tc”
4, 124default 4, configurable with “mlnx_qos --prio_tc”
5, 135default 5, configurable with “mlnx_qos --prio_tc”
6, 146default 6, configurable with “mlnx_qos --prio_tc”
7, 157default 7, configurable with “mlnx_qos --prio_tc”

“ib_write_bw -S ” can be used to check whether RoCE traffic is transmitted via expected priority.

1.1.2. With RDMA_CM

ToS can be configured on RDMA_CM QPs via API rdma_set_option(), with option RDMA_OPTION_ID_TOS;
Linux command line “cma_roce_tos” can be used to set default ToS.

In kernel, ToS {0, 8, 24, 16} would be mapped to SKB Priority {0, 2, 4, 6}, reference kernel code is in include/net/route.h::rt_tos2priority(u8 tos)

Device driver maps SKB Priority to User Priority, the mapping is defined by user, “ip link set dev [vlan interface] type vlan egress-qos-map [sk_prio2egress_prio mapping]” can be used in Linux command line to set the mapping.

UP would be mapped to HW Traffic Class, “mlnx_qos -i [interface] --prio_tc=LIST” is for user setting.

The mapping from ToS to TC in mode of PCP based QoS:

Type of ServiceSKB PriorityUser PriorityHW Traffic Class
0…7, 32…39, 64…71, 96…103, 128…135, 160…167, 192…199, 224…2310Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
8…15, 40…47, 72…79, 104…111, 136…143, 168…175, 200…207, 232…2392Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
16…23, 48…55, 80…87, 112…119, 144…151, 176…183, 208…215, 240…2474Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
24…31, 56…63, 88…95, 120…127, 152…159, 184…191, 216…223, 248…2556Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw -R --tos [ToS value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.2. Trust L3 - DSCP based QoS

DSCP based QoS can be configured for RoCEv2 traffic, it is not applicable for RoCEv1.

1.2.1. With VPI Verbs

QPs can be created with specific TClass (Traffic Class, 8 bits, range 0-255) from application.
Device driver maps TClass to DSCP field (6 bits) in IP header. DSCP takes 6 MSB’s of Tclass:

DSCP = TClass >> 2

Device driver maps DSCP to User Priority, the mapping can be configured by user with Linux command line “mlnx_qos -i --dscp2prio=DSCP2PRIO”.

UP would be mapped to HW Traffic Class, “mlnx_qos -i [interface] --prio_tc=LIST” is for user to configure the mapping.

The mapping from TClass to TC in mode of DSCP based QoS:

TClassDSCPUser PriorityHW Traffic Class
0…310…7Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
32…638…15Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
64…9516…23Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
96…12724…31Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qo --prio_tcs”
128…15932…39Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
160…19140…47Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
192…22348…55Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
224…25556…63Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw --tclass [TClass value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.2.2. With RDMA_CM

In this mode, ToS (Type of Service) has range 0-255, and is regarded by device driver as the value of TClass.
ToS can be configured on RDMA_CM QPs via API rdma_set_option(), with option RDMA_OPTION_ID_TOS;
Linux command line “cma_roce_tos” can be used to set default ToS.
The mappings of “ToS(TClass) – DSCP – User Priority – HW Traffic Class” are the same as described in VPI verbs method in mode of DSCP based QoS configuration:

ToSDSCPUser PriorityHW Traffic Class
0…310…7Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
32…638…15Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
64…9516…23Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
96…12724…31Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
128…15932…39Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
160…19140…47Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
192…22348…55Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
224…25556…63Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw -R --tos [ToS value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.3. Summary

The table is a summary for priority configuration on RoCE traffic as described upon, including:

  • traffic type - RoCEv1 and RoCEv2
  • QoS type - L2 PCP and L3 DSCP
  • Configuration method - VPI verbs and RDMA_CM
  • Verification with ib_write_bw - with options to check if it is the expected configuration
RoCE TrafficQoSConfigurationVerification
RoCEv1L2 PCPVPI verbsSL [0…15]ib_write_bw -S
RDMA_CMToS [0, 8, 24, 16]ib_write_bw -R --tos
RoCEv2L2 PCPVPI verbsSL [0…15]ib_write_bw -S
RDMA_CMToS [0, 8, 24, 16]ib_writte_bw -R --tos
L3 DSCPVPI verbsTClass [0…255]ib_write_bw --tclass
RDMA_CMToS [0…255]ib_write_bw -R --tos

2. Configuration example

This section targets to provide examples for the configuration mentioned above.

2.1. RoCEv2 over Lossless Fabric (ECN+PFC) with Trust L2 QoS

2.1.1. Host configuration

This example provides configuration on host with Redhat 7.4. Dual-port ConnectX-5 is used, and ethernet port ens1f1 is used in the setup.

Enable 8021q Linux kernel module on hosts (for traffic that passes via the kernel, it is not required for RoCE which is kernel bypass)

[root]# modprobe 8021q
[root]# lsmod | grep 8021q
8021q 33208 0
garp 14384 1 8021q
mrp 18542 1 8021q

Set VLAN on hosts

[root]# ip link add link ens1f1 name ens1f1.100 type vlan id 100
[root]# ip link show
23: ens1f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4200 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:ae:11:dd brd ff:ff:ff:ff:ff:ff
24: ens1f1.100@ens1f1: <BROADCAST,MULTICAST> mtu 4200 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:ae:11:dd brd ff:ff:ff:ff:ff:ff

Set IP for VLAN interface on hosts

[root]# ifconfig ens1f1.100 10.0.0.10

Configure PFC on hosts (enable PFC for priority#3)

[root]# mlnx_qos -i ens1f1 --pfc=0,0,0,1,0,0,0,0

Enable ECN on priority 3 (optional, ECN is enabled by default)

[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_np/enable/3
[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_rp/enable/3

Set CNP L2 egress priority on 6 (optional, it is the default value)

[root]# echo 6 > /sys/class/net/ens1f1/ecn/roce_np/cnp_802p_prio

Enable ECN on the TCP traffic (optional, only when it is required for kernel TCP traffic)

[root]# sysctl -w net.ipv4.tcp_ecn=1
net.ipv4.tcp_ecn = 1

Set RoCE mode to V1 for RDMA CM traffic

[root]# cma_roce_mode -d mlx5_1 -p 1 -m 2
RoCE v2

Set default ToS to 24, it is mapped to skprio 4:

[root]# cma_roce_tos -d mlx5_1 -t 24
24

Set Egress priority map (skprio#4 mapped to to user priority#3)

[root]# ip link set dev ens1f1.100 type vlan egress-qos-map 4:3
[root]# cat /proc/net/vlan/ens1f1.100 ens1f1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: ens1f1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings: 4:3

Upon the configuration, the default priority for RoCEv2 has the mapping, it will take effect on RDMC_CM QPs:

ToSSKB PriorityUser PriorityHW Traffic Class
24433

2.1.2. Switch configuration

This example provides configuration on switch with MLNX ONYX, it has port 1/3 and 1/4 used in the setup.

Enable VLAN on switch ethernet interfaces

(config) # vlan 100
(config) # interface ethernet 1/3 switchport mode hybrid
(config) # interface ethernet 1/4 switchport mode hybrid
(config) # interface ethernet 1/3 switchport hybrid allowed-vlan 100
(config) # interface ethernet 1/4 switchport hybrid allowed-vlan 100

Set trust mode on switch ethernet interfaces

(config) # interface ethernet 1/3 qos trust L2
(config) # interface ethernet 1/4 qos trust L2

Create lossless buffer pool on switch

(config) # traffic pool roce_lossless type lossless
(config) # traffic pool roce_lossless memory percent 50.00

Map switch priority 3 to lossless buffer pool

(config) # traffic pool roce_lossless map switch-priority 3

2.1.3. Verification

To verify the settings, generate RoCEv2 traffic on hosts

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R [IP address]

Observe priority-based counters on host:

  • RoCEv2 traffic is expected to be transmitted via priority#3
  • (if any) PFC is expected to be transmitted via priority#3
  • (if any) CNP is expected to be transmitted via priority#6

[root]# ethtool -S ens1f1 | grep prio

Observe priority-based counters on switch ports:

  • RoCEv2 traffic is expected to be transmitted via tc 3
  • (if any) CNP is expected to be transmitted via tc 6

[root]# show interfaces ethernet [switch port] counters tc all

Observe traffic pool-based counters on switch ports, RoCEv2 traffic is expected to be transmitted via pg 1

(config) # show interfaces ethernet [switch port] counters pg all

Observe PFC counters on switch ports, (if any) PFC is expected to be observed in PFC 3

(config) # show interface ethernet [switch port] counters pfc prio all

2.2. RoCEv2 over Lossless Fabric (ECN+PFC) with Trust L3 QoS

2.2.1. Host configuration

This example provides configuration on host with Redhat 7.4. Dual-port ConnectX-5 is used, and ethernet port ens1f1 is used in the setup.

Set trust mode to DSCP on hosts

[root]# mlnx_qos -i ens1f1 –trust=dscp

Set default ToS to 106 (DSCP 26) for all RoCE traffic on the port, it will take effect on QPs created with RDMA_CM and rdma_set_option() is not used by application with option RDMA_OPTION_ID_TOS:

[root]# cma_roce_tos -d [mlx-device] -t 106

Set global TClass to 106 (DSCP 26) for all RoCE traffic on the port, pay attention that it is a global forced value, will be be applied to all QPs, has precedence over cma_roce_tos setting and value specified by user application:

[root]# echo 106 > /sys/class/infiniband/[mlx-device]/tc/1/traffic_class

Rules can be added to QPs for the global forced value. In the example, TClass will be set on loopback traffic with IP address 11.7.156.133:

[root]# echo “src_ip=11.7.156.133/32,dst_ip=11.7.156.133/32,tclass=106” > /sys/class/infiniband/mlx5_0/tc/1/traffic_class

Configure PFC on hosts (enable PFC for priority#3)

[root]# mlnx_qos -i ens1f1 --pfc=0,0,0,1,0,0,0,0

Enable ECN on priority 3 (optional, ECN is enabled by default)

[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_np/enable/3
[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_rp/enable/3

Set CNP L3 egress priority on 6 (optional, it is the default value)

[root]# echo 48 > /sys/class/net/ens1f1/ecn/roce_np/cnp_dscp

Enable ECN on the TCP traffic (optional, only when it is required for kernel TCP traffic)

[root]# sysctl -w net.ipv4.tcp_ecn=1
net.ipv4.tcp_ecn = 1

Set RoCE mode to V2 for RDMA CM traffic

[root]# cma_roce_mode -d mlx5_1 -p 1 -m 2
RoCE v2

Upon the configuration, the default priority for RoCEv2 has the mapping, it will take effect on QPs created with VPI verbs and RDMA_CM:

TClassToSDSCPUser PriorityHW Traffic Class
1061062633

2.2.2. Switch configuration

This example provides configuration on switch with MLNX ONYX, it has port 1/3 and 1/4 used in the setup.

Set trust mode on switch ethernet interfaces

(config) # interface ethernet 1/3 qos trust L3
(config) # interface ethernet 1/4 qos trust L3

Create lossless buffer pool on switch

(config) # traffic pool roce_lossless type lossless
(config) # traffic pool roce_lossless memory percent 50.00

Map switch priority 3 to lossless buffer pool

(config) # traffic pool roce_lossless map switch-priority 3

2.2.3. Verification

To verify the settings, generate RoCEv2 traffic on hosts with RDMA_CM QPs

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R [P address]

Or generate RoCEv2 traffic on hosts with QPs created by VPI verbs

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely [IP address]

Observe priority-based counters on host:

  • RoCEv2 traffic is expected to be transmitted via priority 3
  • (if any) PFC is expected to be transmitted via priority 3
  • (if any) CNP is expected to be transmitted via priority 6

[root] # ethtool -S ens1f1 | grep prio

Observe priority-based counters on switch ports:

  • RoCEv2 traffic is expected to be transmitted via tc 3
  • (if any) CNP is expected to be transmitted via tc 6

[root]# show interfaces ethernet [switch port] counters tc all

Observe traffic pool-based counters on switch ports, RoCEv2 traffic is expected to be transmitted via pg 1

(config) # show interfaces ethernet [switch port] counters pg all

Observe PFC counters on switch ports, (if any) PFC is expected to be observed in PFC 3

(config) # show interface ethernet [switch port] counters pfc prio all

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值