问题背景
业务服务器使用IntelXL710网卡,上线使用过程中网卡突然断链,Link状态为down,而且不可恢复,必须
复位服务器才可以。但是过一段时间后,会再次出现同样的问题,而且在几个局点都出现了类似的问题。
开始出现该问题时,根据以往经验,无非是光模块、光纤兼容问题,网卡硬件批次问题。但是随着出问题
的设备增多,次数增多,开始觉得该问题没这么简单。
先看看生产环境收集到的信息:
网卡状态信息
# ip a | grep eth4
5: eth4: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond5 state DOWN qlen 1000
GZ-SN-OTT18:~ #
GZ-SN-OTT18:~ # ethtool eth4
Settings for eth4:
Supported ports: [ ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
10000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: Unknown!
Duplex: Unknown! (255)
Port: Other
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x0000000f (15)
drv probe link timer
Link detected: no
内核日志信息
Jul 20 11:28:38 JAedge5 kernel: [8395582.875067] i40e 0000:0a:00.2: TX driver issue detected, PF reset issued ##网卡错误
Jul 20 11:28:38 JAedge5 kernel: klogd 1.4.1, ---------- state change ----------
Jul 20 11:28:38 JAedge5 kernel: [8395583.453724] kworker/u:4: page allocation failure: order:5, mode:0x80d0 ##分配order=5的连续page失败
Jul 20 11:28:38 JAedge5 kernel: [8395583.453734] Pid: 21027, comm: kworker/u:4 Tainted: G ENX 3.0.101-0.47.52-default #1
Jul 20 11:28:38 JAedge5 kernel: [8395583.453738] Call Trace:
Jul 20 11:28:38 JAedge5 kernel: [8395583.453762] dump_trace+0x75/0x300
Jul 20 11:28:38 JAedge5 kernel: [8395583.453777] dump_stack+0x69/0x6f
Jul 20 11:28:38 JAedge5 kernel: [8395583.453790] warn_alloc_failed+0xc6/0x170
Jul 20 11:28:38 JAedge