HOW TO CONFIGURE ROCE OVER LAG Mellanox CONNECTX

CONTENT

DESCRIPTION

*Please note: this networking configuration is not suitable for K8s and NCCL-based AI applications.


*For the RoCE recommended configuration and verification, please click here .

This post discusses the usage and testing of RoCE over a Link Aggregation (LAG) interface for ConnectX-4 adapters.

REFERENCES

BACKGROUND

RoCE LAG is a feature meant for mimicking Ethernet bonding for IB devices and is available for dual port cards only.

The LAG mode is entered when both the Ethernet interfaces that belong to the same card are only slaves to the same bond interface and the bonding mode is one of the following:

  • active-backup (mode 1)
  • balance-xor (mode 2)
  • 802.3ad (LACP) (mode 4)

If there is any change of the bond configuration that negates one of the above rules (that is, bonding mode is not 1, 2 or 4, or both Ethernet interfaces that belong to the same card are not the only slaves of the bond interface), you will exit RoCE LAG mode and return to a normal InfiniBand (IB) device per port configuration.

When in RoCE LAG mode, instead of having an IB device per physical port (for example mlx5_0 and mlx5_1), only one IB device will be present for both ports with 'bond' appended to its name (for example mlx5_bond_0). This device provides an aggregation of both IB ports, just as the bond interface provides an aggregation of both Ethernet interfaces.

The method used by the IB bond device to distribute traffic depends on the Ethernet bond interface's mode, as follows:

  • active-backup (mode 1): QPs are always assigned to the active slave's physical port (chosen by the bond interface). When the active slave changes, all QPs are moved to the new active slave. Note that in order for the bond driver to act on link-down events and set the other interface to be the active slave, it must be configured with the "miimon" option (for example "miimon = 100").
  • balance-xor (mode 2) and 802.3ad (LACP) (mode 4): QPs are assigned to physical ports in a round-robin manner. Each time a QP is moved from RESET to the INIT state, it is assigned to one of the two physical ports at random.

Note: The connected switch or (server) should be configured on the same way.

SETUP

  • Use two servers - each equipped with a dual port Mellanox ConnectX-4 adapter card, connected through a switch.
  • Configure the ports in Ethernet mode.
  • Install the MLNX_OFED Rel. 4.0 or later on both servers.
  • Use Kernel 4.9 or later. Also RHEL7.4 and newer RHEL7.x include the support.

CONFIGURATION EXAMPLE

1. Make sure bonding is enabled on the server. Follow the distribution OS manuals to create a bond0 interface, see here for bonding example over RHEL OS.

2. Set the bond0 interface on both servers, edit /etc/sysconfig/network-scripts/ifcfg-bond0 as follows:

DEVICE=bond0

NAME=bond0

TYPE=bond

BONDING_MASTER=yes

IPADDR=22.22.22.6 #the other server should have different IP on the same subnet

PREFIX=24

BOOTPROTO=none

ONBOOT=yes

NM_CONTROLLED=no

BONDING_OPTS="mode=active-backup miimon=100 updelay=100 downdelay=100"

In this example, the bond mode is active-backup.

3. Set the physical port of the Mellanox ConnectX-4 adapter, edit /etc/sysconfig/network-scripts/ifcfg-ens817.

DEVICE=ens817

TYPE=Ethernet

ONBOOT=yes

MASTER=bond0

SLAVE=yes

BOOTPROTO=none

4. Set the other physical port of Mellanox ConnectX-4 adapter, edit /etc/sysconfig/network-scripts/ifcfg-ens817d1.

DEVICE=ens817d1

TYPE=Ethernet

ONBOOT=yes

MASTER=bond0

SLAVE=yes

BOOTPROTO=none

5. Restart the port (or the driver)

# ifdown bond0

# ifup bond0

or

# /etc/init.d/network restart

# /etc/init.d/openibd restart

6. Make sure that you can ping between the two servers using the bond0 interface.

7. Check the current status of bond0:

# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: ens817

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 100

Down Delay (ms): 100

Slave Interface: ens817

MII Status: up

Speed: 40000 Mbps

Duplex: full

Link Failure Count: 6

Permanent HW addr: e4:1d:2d:26:3c:e1

Slave queue ID: 0

Slave Interface: ens817d1

MII Status: up

Speed: 40000 Mbps

Duplex: full

Link Failure Count: 6

Permanent HW addr: e4:1d:2d:26:3c:e2

Slave queue ID: 0

If you use bond mode 802.3ad, your output looks like the example:

# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer2 (0)

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 100

Down Delay (ms): 100

802.3ad info

LACP rate: slow

Min links: 0

Aggregator selection policy (ad_select): stable

Active Aggregator Info:

Aggregator ID: 1

Number of ports: 2

Actor Key: 1

Partner Key: 1

Partner Mac Address: e4:1d:2d:26:33:31

Slave Interface: ens817

MII Status: up

Speed: 40000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: e4:1d:2d:26:3c:e1

Aggregator ID: 1

Slave queue ID: 0

Slave Interface: ens817d1

MII Status: up

Speed: 40000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: e4:1d:2d:26:3c:e2

Aggregator ID: 1

Slave queue ID: 0

8. Run one of the perftests tools to check RDMA, for example use ib_send_bw.

Run the ib_send_bw server on one host:

#ib_send_bw -D60 -f --report_gbits

and the ib_send_bw client on the other host:

#ib_send_bw 22.22.22.6 -D60 -f --report_gbits

TESTING PROCEDURE

1. Use a simple testing procedure for active-backup mode, which involves running RoCE traffic over a bond0 interface. Within that time, close and open the ports to see that the connection is still open and traffic is running (while the traffic is being toggled between the ports).

2. For active-active bonding mode it could be more tricky, as you will need to create multiple RoCE flows, each of which is load-balanced from a different port.

IMPORTANT SWITCH NOTES

most switches are configured for MAC-address based load balancing. it is highly recommended to switch to IP-address and UDP-port based load balancing. example for Mellanox Onyx switches:

sw01 [standalone: master] (config) # port-channel load-balance ethernet source-destination-ip source-destination-port

sw01 [standalone: master] (config) # show interfaces port-channel load-balance

source-destination-ip, source-destination-port

ARTICLE NUMBER

000004641

  • 9
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值