HPC&并行计算集群Infiniband高速网络排查步骤

目录

背景介绍:

排查步骤:

一、检查集群中所有节点infiniband网络状态

二、查看各节点端口速率

三、查看集群中各节点OFED版本是否一致

四、检查集群中HCA Firmware版本号是否一致。

五、查看整个IB网络拓扑结构

六、查看IB网络端口报错状况:

七、查找问题端口或线缆物理位置

IB网络端口报错原因


背景介绍:

随着服务器处理速度的不断加快,用户对提升网络传输速度的要求也越来越紧迫,传统意义上的输入/输出技术,如PCI、以太等,以无法满足这种需求。如何更为有效的提高数据传输速度,增加有效带宽成为摆在人们面前必须解决的一个重大问题。InfiniBand标准就是在这种情况下应运而生的,它在很大程度上解决了传统输入/输出架构的的传输瓶颈问题,运行速度最大可达到每端口56Gb/s。目前,这项技术已经广泛应用到高性能计算领域,linux操作系统内核也对其提供全面的支持。而我们再项目实施过程中,随着IB网络规模的不断扩大,IB网络的问题也越来越多,并严重影响到集群中程序执行效率。

本文主要介绍如何使用工具排查IB网络错误,并提供相应的解决方案,给用户提供一个高效可用的infiniband网络环境。

排查步骤:

一、检查集群中所有节点infiniband网络状态

可使用nprsh并行检测,如下图所示:

root@cluster ~ # nprsh  -on c2..12  "ibstat |grep Sta"

[c7]            State: Active

[c6]            State: Active

[c3]            State: Active

[c10]           State: Active

[c8]            State: Active

[c5]            State: Active

[c11]           State: Active

[c2]            State: Active

[c12]           State: Active

[c4]            State: Active

[c9]            State: Active

如果,State状态为down,说明该节点的infiniband网络不通,一般是由于openibd服务未开启或系统识别不到HCA卡造成的。

二、查看各节点端口速率

使用nprsh 查看各节点端口速率是否一致,如不一致,查看方式如下图所示:

root@cluster ~ # nprsh  -on c2..12 "ibstat |grep Rate "

[c2]            Rate: 40

[c5]            Rate: 40

[c9]            Rate: 40

[c3]            Rate: 40

[c12]           Rate: 40

[c10]           Rate: 40

[c6]            Rate: 40

[c11]           Rate: 40

[c8]            Rate: 40

[c7]            Rate: 40

[c4]            Rate: 40

如不一致一般是由于线缆问题造成。

三、查看集群中各节点OFED版本是否一致

在集群中要统一OFED版本,防止某些应用对OFED版本兼容性问题。可通过nprsh并行检查,如下图所示:

root@cluster ~ # nprsh  -on c2..12 "ofed_info |grep OFED- "

[c3]OFED-1.5.2:

[c7]OFED-1.5.2:

[c11]OFED-1.5.2:

[c2]OFED-1.5.2:

[c6]OFED-1.5.2:

[c12]OFED-1.5.2:

[c4]OFED-1.5.2:

[c8]OFED-1.5.2:

[c10]OFED-1.5.2:

[c9]OFED-1.5.2:

[c5]OFED-1.5.2:

发现OFED版本不一致情况,需重新更新驱动,统一OFED版本。

四、检查集群中HCA Firmware版本号是否一致。

可使用nprsh并行检查,如下图所示:

root@cluster ~ # nprsh  -on c2..12  "ibstat |grep Firmware"

[c7]    Firmware version: 2.8.600

[c2]    Firmware version: 2.8.600

[c11]   Firmware version: 2.8.600

[c5]    Firmware version: 2.8.600

[c10]   Firmware version: 2.8.600

[c4]    Firmware version: 2.8.600

[c8]    Firmware version: 2.8.600

[c12]   Firmware version: 2.8.600

[c3]    Firmware version: 2.8.600

[c9]    Firmware version: 2.8.600

[c6]    Firmware version: 2.8.600

在实施过程中可能会遇到HCA Firmware版本不一致情况,需要对不一致HCA卡重新刷Firmware,避免后期应用过程中网络不稳定。

五、查看整个IB网络拓扑结构

使用ibnetdiscover得到整个网络拓扑结构,并保存到文件ibnet中,方便以后查找错误端口对应关系。如下图所示:

root@cluster ~ # ibnetdiscover > ibnet

Ibnetdiscover输出说明:

#

# Topology file: generated on Sun Jul  3 13:23:12 2011

#

# Initiated from node 10d2c91000000750 port 10d2c91000000751

vendid=0x2c9

devid=0xbd36

sysimgguid=0x2c9020042fbab

switchguid=0x2c9020042fba8(2c9020042fba8) (本交换机guid号)

Switch  36 "S-0002c9020042fba8"         # "Infiniscale-IV Mellanox Technologies" base port 0 lid 7 lmc 0

[1]     "H-10d2c91000000080"[1](10d2c91000000081)               # "compute-0-8 HCA-1" lid 18 4xQDR (本交换机第1个端口所连接的节点的hostname为 compute-0-8)

[2]     "H-10d2c91000000360"[1](10d2c91000000361)               # "compute-0-11 HCA-1" lid 21 4xQDR (本交换机第2个端口所连接的节点的hostname为 compute-0-11)

[3]     "H-10d2c91000000ac0"[1](10d2c91000000ac1)               # "compute-0-9 HCA-1" lid 19 4xQDR (本交换机第3个端口所连接的节点的hostname为 compute-0-9)

[4]     "H-10d2c910000000d0"[1](10d2c910000000d1)               # "compute-0-13 HCA-1" lid 23 4xQDR (同上依次类推)

[5]     "H-10d2c91000000c90"[1](10d2c91000000c91)               # "compute-0-10 HCA-1" lid 20 4xQDR

[6]     "H-10d2c910000004c0"[1](10d2c910000004c1)               # "compute-0-12 HCA-1" lid 57 4xQDR

[19]    "S-0002c9020042fb78"[13]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR (本交换机第19端口与guid为0002c9020042fb78的交换机第13端口相连)

[20]    "S-0002c9020042fb78"[17]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR(本交换机第20端口与guid为0002c9020042fb78的交换机第17端口相连)

[21]    "S-0002c9020042fb78"[14]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR(同上依次类推)

[22]    "S-0002c9020042fb78"[15]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR

[23]    "S-0002c9020042fb78"[18]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR

[24]    "S-0002c9020042fb78"[16]                # "Infiniscale-IV Mellanox Technologies" lid 4 4xQDR

[25]    "S-0002c9020042fb58"[22]                # "Infiniscale-IV Mellanox Technologies" lid 2 4xQDR

[26]    "S-0002c9020042fb58"[23]                # "Infiniscale-IV Mellanox Technologies" lid 2 4xQDR

[29]    "S-0002c9020042fb58"[19]                # "Infiniscale-IV Mellanox Technologies" lid 2 4xQDR

[30]    "S-0002c9020042fb58"[20]                # "Infiniscale-IV Mellanox Technologies" lid 2 4xQDR

[31]    "S-0002c9020042fb98"[31]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

[32]    "S-0002c9020042fb98"[35]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

[33]    "S-0002c9020042fb98"[32]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

[34]    "S-0002c9020042fb98"[36]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

[35]    "S-0002c9020042fb98"[33]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

[36]    "S-0002c9020042fb98"[34]                # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR

六、查看IB网络端口报错状况:

在清查IB网络时需对IB网络进行压力测试,一般可选择运行基于IB网络的测试程序,如all-to-all或linpack等。

使用ibdiagnet工具查看IB网络端口报错及警告信息:

因为使用ibdiagnet工具查看到的网络端口信息是在一段时间内积累的数据信息,所以在每次查看之前应把整个IB网络端口信息清零。使用命令为:ibdiagnet –pc

root@cluster ~ # ibdiagnet  -pc

Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.5.7

-W- Topology file is not specified.

    Reports regarding cluster links will use direct routes.

Loading IBDM from: /usr/lib64/ibdm1.5.7

-I- Using port 1 as the local port.

-I- Discovering ... 63 nodes (6 Switches & 57 CA-s) discovered.(检测到的节点及交换机数目)

-I---------------------------------------------------

-I- Bad Guids/LIDs Info

-I---------------------------------------------------

-I- No bad Guids were found

-I---------------------------------------------------

-I- Links With Logical State = INIT

-I---------------------------------------------------

-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------

-I- General Device Info

-I---------------------------------------------------

-I---------------------------------------------------

-I- PM Counters Info

-I---------------------------------------------------

-I- No illegal PM counters values were found

-I---------------------------------------------------

-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)

-I---------------------------------------------------

-I-    PKey:0x7fff Hosts:57 full:57 limited:0

-I---------------------------------------------------

-I- IPoIB Subnets Check

-I---------------------------------------------------

-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00

-I---------------------------------------------------

-I- Bad Links Info

-I- No bad link were found

-I---------------------------------------------------

----------------------------------------------------------------

-I- Stages Status Report:(整个网络的错误及警告情况,已清零

    STAGE                              Errors Warnings

    Bad GUIDs/LIDs Check                  0      0

    Link State Active Check                 0      0

    General Devices Info Report             0      0

    Performance Counters Report            0      0

    Partitions Check                       0      0

    IPoIB Subnets Check                    0      0

Please see /tmp/ibdiagnet.log for complete log

----------------------------------------------------------------

-I- Done. Run time was 6 seconds.

使用“ibdiagnet  -P all=1 “ 命令查看IB网络中端口所有报错及警告信息,如下所示:

Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.5.7

-W- Topology file is not specified.

    Reports regarding cluster links will use direct routes.

Loading IBDM from: /usr/lib64/ibdm1.5.7

-I- Using port 1 as the local port.

-I- Discovering ... 0 nodes (0 Switches & 0 CA-s) discovered.

(……)

-I---------------------------------------------------

-I- Bad Guids/LIDs Info

-I---------------------------------------------------

-I- No bad Guids were found

-I---------------------------------------------------

-I- Links With Logical State = INIT

-I---------------------------------------------------

-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------

-I- General Device Info

-I---------------------------------------------------

-I---------------------------------------------------

-I- PM Counters Info

-I---------------------------------------------------

-W- lid=0x0006 guid=0x0002c9020042fb98 dev=48438 Port=30

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x15e (接受错误,并要求重新发送接受次数

      symbol_error_counter            : 0x160  (传输错误,并要求重新传送信号

-W- lid=0x0008 guid=0x10d2c91000000661 dev=26428 nas-0-1/P1

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x3

      symbol_error_counter            : 0x34b (Increase by 1 during ibdiagnet

      scan.)

-W- lid=0x0005 guid=0x0002c9020042b490 dev=48438 Port=23

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0xd1

      symbol_error_counter            : 0xd2

-W- lid=0x0005 guid=0x0002c9020042b490 dev=48438 Port=9

      Performance Monitor counter     : Value

      port_xmit_discard               : 0x4

-W- lid=0x0002 guid=0x0002c9020042fb58 dev=48438 Port=24

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x4e

      symbol_error_counter            : 0x78

-W- lid=0x0002 guid=0x0002c9020042fb58 dev=48438 Port=30

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x4

      symbol_error_counter            : 0x4

-W- lid=0x0007 guid=0x0002c9020042fba8 dev=48438 Port=23

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x13

      symbol_error_counter            : 0x17

-I---------------------------------------------------

-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)

-I---------------------------------------------------

-I-    PKey:0x7fff Hosts:56 full:56 limited:0

-I---------------------------------------------------

-I- IPoIB Subnets Check

-I---------------------------------------------------

-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00

-I---------------------------------------------------

-I- Bad Links Info

-I- No bad link were found

-I---------------------------------------------------

----------------------------------------------------------------

-I- Stages Status Report:

    STAGE                                    Errors Warnings

    Bad GUIDs/LIDs Check                     0      0    

    Link State Active Check                  0      0    

    General Devices Info Report              0      0    

    Performance Counters Report              0      7    

    Partitions Check                         0      0    

    IPoIB Subnets Check                      0      0    

Please see /tmp/ibdiagnet.log for complete log

----------------------------------------------------------------

通过查看“ ibdiagnet –P all=1 “命令输出结果可得有7个端口报Warning,如

-W- lid=0x0006 guid=0x0002c9020042fb98 dev=48438 Port=30

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x15e (接受错误,并要求重新发送接受次数)

      symbol_error_counter            : 0x160  (传输错误,并要求重新传送信号)

指guid号为0x0002c9020042fb98的交换机第30端口报350个port_rcv_errors、352个symbol_error_counter警告信息。(注:在QDR网络中,symbol_error_counter、port_rcv_errors警告信息每分钟允许出现3次,超过此值则说明IB端口或线缆有问题)

检测IB网络端口错误的过程可总结如下:运行基于IB网络压力测试程序,使用“ ibdiagnet –pc “命令清空IB网络错误信息,开始计时间隔5分钟,运行命令” ibdiagnet –P all=1 “ 记录端口错误或warning超过5*3=15个的端口为问题端口。

七、查找问题端口或线缆物理位置

通过上步ibdiagnet –P all=1 记录的问题端口的guid,在ibnetdiscover得到的整个网络拓扑结构图ibnet文件中,找到问题交换机端口或线路。如问题端口为:

-W- lid=0x0006 guid=0x0002c9020042fb98 dev=48438 Port=30

      Performance Monitor counter     : Value

      port_rcv_errors                 : 0x15e

      symbol_error_counter            : 0x160  

则在ibnet中找到guid=0x0002c9020042fb98的交换机的第30个口,其余错误端口的查找方式类推,如下图所示:

vendid=0x2c9

devid=0xbd36

sysimgguid=0x2c9020042fb9b

switchguid=0x2c9020042fb98(2c9020042fb98)

Switch  36 "S-0002c9020042fb98"         # "Infiniscale-IV Mellanox Technologies" base port 0 lid 6 lmc 0

[1]     "H-10d2c91000000e80"[1](10d2c91000000e81)               # "compute-0-48 HCA-1" lid 53 4xQDR

[5]     "H-10d2c910000005b0"[1](10d2c910000005b1)               # "compute-0-51 HCA-1" lid 1 4xQDR

[13]    "H-10d2c91000000d50"[1](10d2c91000000d51)               # "compute-0-36 HCA-1" lid 44 4xQDR

[14]    "H-10d2c91000000ab0"[1](10d2c91000000ab1)               # "compute-0-33 HCA-1" lid 41 4xQDR

[15]    "H-10d2c91000000c20"[1](10d2c91000000c21)               # "compute-0-35 HCA-1" lid 43 4xQDR

[16]    "H-10d2c91000000370"[1](10d2c91000000371)               # "compute-0-32 HCA-1" lid 40 4xQDR

[17]    "H-10d2c91000000af0"[1](10d2c91000000af1)               # "compute-0-34 HCA-1" lid 42 4xQDR

[18]    "H-10d2c91000001220"[1](10d2c91000001221)               # "compute-0-31 HCA-1" lid 39 4xQDR

[19]    "S-0002c9020042b490"[20]                # "Infiniscale-IV Mellanox Technologies" lid 5 4xQDR

[20]    "S-0002c9020042b490"[21]                # "Infiniscale-IV Mellanox Technologies" lid 5 4xQDR

[21]    "S-0002c9020042b490"[22]                # "Infiniscale-IV Mellanox Technologies" lid 5 4xQDR

[22]    "S-0002c9020042b490"[19]                # "Infiniscale-IV Mellanox Technologies" lid 5 4xQDR

[24]    "S-0002c9020042b490"[24]                # "Infiniscale-IV Mellanox Technologies" lid 5 4xQDR

[25]    "S-0002c9020042fb68"[29]                # "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

[26]    "S-0002c9020042fb68"[30]                # "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

[27]    "S-0002c9020042fb68"[27]                # "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

[28]    "S-0002c9020042fb68"[25]                # "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

[29]    "S-0002c9020042fb68"[26]                # "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

[30]    "S-0002c9020042fba8"[31]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

[32]    "S-0002c9020042fba8"[33]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

[33]    "S-0002c9020042fba8"[35]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

[34]    "S-0002c9020042fba8"[36]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

[35]    "S-0002c9020042fba8"[32]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

[36]    "S-0002c9020042fba8"[34]                # "Infiniscale-IV Mellanox Technologies" lid 7 4xQDR

IB网络端口报错原因

端口报错严重问题一般是由以下几种情况造成:

1)HCA未插牢固或有问题;

2)如果是刀片节点,则可能是由于IB模块未插好或IB模块的某个端口有问题或刀片节点子卡问题;

3)连接端口的线缆的有问题;

4)散热问题,IB模块温度过高。

  • 19
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

技术瘾君子1573

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值