遇到一个问题,x86板卡通过PCIE外接网卡芯片82599,某一块板子过一会出现网络不通的问题,排查发现某一时刻系统出现如下异常:
[ 1250.888189] Uhhuh. NMI received for unknown reason 31 on CPU 0.
[ 1250.894247] Do you have a strange power saving mode enabled?
[ 1250.899962] Dazed and confused, but trying to continue
[ 1250.948622] ixgbe 0000:05:00.0: Adapter removed
之后,网络就不通了,相关信息如下:
[root@2aa477aa-cc3a-11e8-8c50-6c92bf992afa ~]# ethtool eth5
Settings for eth5:
Supported ports: [ FIBRE ]
Supported link modes: 10000baseKX4/Full
10000baseKR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: No
Advertised link modes: 10000baseKX4/Full
10000baseKR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: No
Speed: Unknown!
Duplex: Unknown! (255)
Port: Other
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Current message level:0x00000007 (7)
drv probe link
Link detected: no
[root@2aa477aa-cc3a-11e8-8c50-6c92bf992afa ~]# lspci -s 5:0.0 -vxxx
05:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Network Connection (rev 01)
Physical Slot: 3
Flags: fast devsel, IRQ 30
Memory at fbc20000 (64-bit, non-prefetchable) [disabled] [size=128K]
I/O ports at d020 [disabled] [size=32]
Memory at fbc44000 (64-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable- Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [e0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-14-10-ff-ff-1e-e8-8d
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
00: 86 80 fc 10 00 00 10 00 01 00 00 02 10 00 80 00
10: 04 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
20: 04 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00
40: 01 50 23 48 00 20 00 00 00 00 00 00 00 00 00 00
50: 05 70 80 01 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 11 a0 3f 00 04 00 00 00 04 20 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 10 e0 02 00 c2 8c 00 10 10 28 09 00 82 d4 02 00
b0: 00 00 82 10 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
d0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[root@2aa477aa-cc3a-11e8-8c50-6c92bf992afa ~]# ifconfig eth5
eth5: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 10.0.0.3 netmask 255.255.255.0 broadcast 10.0.0.255
ether 00:14:10:1e:e8:8d txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 8589934590 dropped 34359738360 overruns 0 frame 8589934590
TX packets 1 bytes 42 (42.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
写了一个脚本拷机:
#!/bin/bash
sleep 1500
date 1>>/opt/devicedriver/log/net.log 2>&1 &
ethtool eth5 | grep "Speed: 10000Mb/s" 1>>/opt/devicedriver/log/net.log 2>&1 &
res=$(echo $?)
echo $res
if [ $res -eq 0 ]; then
echo "reboot" 1>>/opt/devicedriver/log/net.log 2>&1 &
reboot
else
echo "don't reboot" 1>>/opt/devicedriver/log/net.log 2>&1 &
fi