linux mode4,Linux 网络bond mode 4 的xmit_hash_policy layer3+4 到底是如何hash 的

在通过iperf3 测试4 块网卡做的lacp 链路聚合时,xmit_hash_policy 选择的是layer3+4. 当iperf3 指定的线程数比较少时,总是打不满带宽。

例如在源目的IP和目的端口不变的情况下,四个线程使用连续的4个源端口10001~10004 (iperf3 参数 -P 4-B src_ip –cport 10001)测试tcp 只能打出3 个网卡的效果,从/proc/net/dev 中查看计数,以及从tcpdump 的抓包中,发现只有1,2,3 三块网卡有出流量。

其中10001 用网卡1,10002 和10003 用网卡2,10004 用网卡3,网卡四没有发包。

后续测试其他连续端口,也发现类似的情况,甚至有时候只能打出两张卡的效果。

于是去查询了kernel bonding 文档 得知其hash 方式如下:

layer3+4

This policy uses upper layer protocol information,

when available, to generate the hash. This allows for

traffic to a particular network peer to span multiple

slaves, although a single connection will not span

multiple slaves.

The formula for unfragmented TCP and UDP packets is

hash = source port, destination port (as in the header)

hash = hash XOR source IP XOR destination IP

hash = hash XOR (hash RSHIFT 16)

hash = hash XOR (hash RSHIFT 8)

And then hash is reduced modulo slave count.

If the protocol is IPv6 then the source and destination

addresses are first hashed using ipv6_addr_hash.

For fragmented TCP or UDP packets and all other IPv4 and

IPv6 protocol traffic, the source and destination port

information is omitted. For non-IP traffic, the

formula is the same as for the layer2 transmit hash

policy.

This algorithm is not fully 802.3ad compliant. A

single TCP or UDP conversation containing both

fragmented and unfragmented packets will see packets

striped across two interfaces. This may result in out

of order delivery. Most traffic types will not meet

this criteria, as TCP rarely fragments traffic, and

most UDP traffic is not involved in extended

conversations. Other implementations of 802.3ad may

or may not tolerate this noncompliance.

但按照这个算法算下来的话,10001-10004 四个端口是可以分到四块网卡的。

1

2

3

4

5

6def xmit_hash(slaves, sport, dport, sip, dip):

hash = (sport << 16) + dport

hash = hash ^ sip ^ dip

hash = hash ^ (hash >> 16)

hash = hash ^ (hash >> 8)

return (hash % slaves)

搞不定,就去翻了下源码drivers/net/bonding/bond_main.c,结果发现这玩意居然和文档里写的不一样:

1

2

3

4

5

6hash ^= (__force u32)flow_get_u32_dst(&flow) ^

(__force u32)flow_get_u32_src(&flow);

hash ^= (hash >> 16);

hash ^= (hash >> 8);

return hash >> 1;

翻译成python 大概是:

1

2

3

4

5

6def xmit_hash(slaves, sport, dport, sip, dip):

hash = (sport << 16) + dport

hash = hash ^ sip ^ dip

hash = hash ^ (hash >> 16)

hash = hash ^ (hash >> 8)

return ((hash>>1) % slaves)

这个hash >> 1 就很讲究了,这样算出来在4 网卡聚合的情况下,保持源目的地址和目的端口不变,还是上面的例子,(2n, 2n+1)的源端口号会被分配到同一块卡上,就是10001 一块卡, 10002 和10003 一块卡,10004 一块卡。用三块卡,闲置一块卡,这样的结果确实和测试吻合。

那么为什么要这么搞呢?好在git log 里有答案:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37# git log -p b5f862180d7011d9575d0499fa37f0f25b423b12

Author: Hangbin Liu

Date: Mon Nov 6 09:01:57 2017 +0800

bonding: discard lowest hash bit for 802.3ad layer3+4

After commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range

in connect()"), we will try to use even ports for connect(). Then if an

application (seen clearly with iperf) opens multiple streams to the same

destination IP and port, each stream will be given an even source port.

So the bonding driver's simple xmit_hash_policy based on layer3+4 addressing

will always hash all these streams to the same interface. And the total

throughput will limited to a single slave.

Change the tcp code will impact the whole tcp behavior, only for bonding

usage. Paolo Abeni suggested fix this by changing the bonding code only,

which should be more reasonable, and less impact.

Fix this by discarding the lowest hash bit because it contains little entropy.

After the fix we can re-balance between slaves.

Signed-off-by: Paolo Abeni

Signed-off-by: Hangbin Liu

Signed-off-by: David S. Miller

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c

index c99dc59d729b..76e8054bfc4e 100644

--- a/drivers/net/bonding/bond_main.c

+++ b/drivers/net/bonding/bond_main.c

@@ -3253,7 +3253,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)

hash ^= (hash >> 16);

hash ^= (hash >> 8);

- return hash;

+ return hash >> 1;

}

看名字这位仁兄好像还是个中国人,大意是在2015 的另一个提交07f4c90062f8 改变了一些随机端口号的使用规则,现在使用iperf3 等工具时,随机的源端口号是每次+2,而不是每次+1,这样如果还是原来的xmit_hash_policy 会有一些问题,例如4 网卡的话永远都只能hash 到两个卡。所以他舍弃了hash 的最后一位来应对这个变化。

这个commit 是在2017 年提交,在CentOS/RHEL 7.3 的3.10.0-514 内核中还没有被添加进去。

对于随机端口号每次+2 这个现象,之前用python 做测试的时候也遇见过不过没有深究,现在也找到了答案,算是缘分吧。

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值