linux篇-线上服务器异常自动重启的原因排查

背景:最近发现公司一台惠普服务器异常自动重启了,所以尝试排查下了原因。

排查步骤:

1、登录机器,执行last或uptime等命令,查看重启时间

​$ last | grep reboot
reboot   system boot  3.10.0-1160.24.1 Mon Oct 11 19:19 - 10:49  (15:30)
reboot   system boot  3.10.0-1160.24.1 Wed Oct  6 14:08 - 10:49 (5+20:41)
reboot   system boot  3.10.0-1160.24.1 Mon Oct  4 13:03 - 10:49 (7+21:46)
reboot   system boot  3.10.0-1160.24.1 Sun Oct  3 21:39 - 10:49 (8+13:10)
reboot   system boot  3.10.0-1160.24.1 Sun Oct  3 09:12 - 10:49 (9+01:37)
reboot   system boot  3.10.0-1160.24.1 Sat Sep 25 23:13 - 10:49 (16+11:36)

$ uptime
 10:53:27 up 15:34,  1 user,  load average: 2.43, 1.74, 1.43

2、查看系统相关日志(如dmesg、/var/log/messages、kdump等)

dmesg:开机日志

$ dmesg | grep -Ei 'error|Fail'
[    0.000000] tsc: Fast TSC calibration failed
[    3.120763] pci 0000:12:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
[    3.178571] pci 0000:5c:00.0: BAR 6: failed to assign [mem size 0x00200000 pref]
[    3.223819] pci 0000:5d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
[    3.240238] pci 0000:5d:00.2: BAR 6: failed to assign [mem size 0x00080000 pref]
[    3.256824] pci 0000:5d:00.3: BAR 6: failed to assign [mem size 0x00080000 pref]
[    3.366034] pci 0000:00:14.0: xHCI BIOS handoff failed (BIOS bug ?) 00012201
[    4.635351] ioapic: probe of 0000:00:05.4 failed with error -22
[    4.642051] ioapic: probe of 0000:11:05.4 failed with error -22
[    4.648757] ioapic: probe of 0000:36:05.4 failed with error -22
[    4.655459] ioapic: probe of 0000:5b:05.4 failed with error -22
[    4.662176] ioapic: probe of 0000:80:05.4 failed with error -22
[    4.668874] ioapic: probe of 0000:85:05.4 failed with error -22
[    4.675576] ioapic: probe of 0000:ae:05.4 failed with error -22
[    4.682278] ioapic: probe of 0000:d7:05.4 failed with error -22
[    4.716010] ERST: Error Record Serialization Table (ERST) support is initialized.
[    6.058884] smartpqi: module verification failed: signature and/or required key missing - tainting kernel
[24726.679793] tsar[94262]: segfault at fffffffffffffff0 ip 00007fd5cddf5dd6 sp 00007fff9aa2c608 error 5 in libc-2.17.so[7fd5cdca0000+1c3000]
[24737.612788] tsar[95267]: segfault at fffffffffffffff0 ip 00007f9205c50dd6 sp 00007ffe55047368 error 5 in libc-2.17.so[7f9205afb000+1c3000]
[24740.345420] tsar[95426]: segfault at fffffffffffffff0 ip 00007f99f20efdd6 sp 00007ffe70032fa8 error 5 in libc-2.17.so[7f99f1f9a000+1c3000]

/var/log/messages:系统日志

$ grep -Ei 'error|Fail' /var/log/messages
Oct 11 19:19:35 kuyun.a01.host kernel: tsc: Fast TSC calibration failed
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:12:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5c:00.0: BAR 6: failed to assign [mem size 0x00200000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.2: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.3: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:00:14.0: xHCI BIOS handoff failed (BIOS bug ?) 00012201
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:00:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:11:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:36:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:5b:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:80:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:85:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:ae:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:d7:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ERST: Error Record Serialization Table (ERST) support is initialized.
Oct 11 19:19:35 kuyun.a01.host kernel: smartpqi: module verification failed: signature and/or required key missing - tainting kernel
Oct 11 19:19:35 kuyun.a01.host systemd[1]: Failed to start Configure CPU turboboost.
Oct 11 19:19:35 kuyun.a01.host systemd[1]: Unit cpunoturbo.service entered failed state.
Oct 11 19:19:35 kuyun.a01.host systemd[1]: cpunoturbo.service failed.
Oct 11 19:19:35 kuyun.a01.host syslog-ng[1144]: [2021-10-11T19:19:35.010289] Error resolving hostname; host='syslog.tbsite.net'
Oct 11 19:19:35 kuyun.a01.host syslog-ng[1144]: [2021-10-11T19:19:35.010373] Initiating connection failed, reconnecting; time_reopen='10'
Oct 11 19:19:39 kuyun.a01.host systemd[1562]: Failed at step EXEC spawning /home/staragent/bin/agent.sh: No such file or directory
Oct 11 19:19:39 kuyun.a01.host systemd[1]: Failed to start StarAgent2.0.
Oct 11 19:19:39 kuyun.a01.host systemd[1]: Unit staragentctl.service entered failed state.
Oct 11 19:19:39 kuyun.a01.host systemd[1]: staragentctl.service failed.
Oct 11 19:21:22 kuyun.a01.host useradd[9397]: failed adding user 'terminal', exit code: 9
Oct 12 02:11:08 kuyun.a01.host kernel: tsar[94262]: segfault at fffffffffffffff0 ip 00007fd5cddf5dd6 sp 00007fff9aa2c608 error 5 in libc-2.17.so[7fd5cdca0000+1c3000]
Oct 12 02:11:19 kuyun.a01.host kernel: tsar[95267]: segfault at fffffffffffffff0 ip 00007f9205c50dd6 sp 00007ffe55047368 error 5 in libc-2.17.so[7f9205afb000+1c3000]
Oct 12 02:11:22 kuyun.a01.host kernel: tsar[95426]: segfault at fffffffffffffff0 ip 00007f99f20efdd6 sp 00007ffe70032fa8 error 5 in libc-2.17.so[7f99f1f9a000+1c3000]

kdump:宕机日志

kdump服务的log日志文件路径在/var/crash/目录下,但当时没看到有日志生成。

$ grep -Ei 'fail|error' /var/crash/<对应宕机日期>/vmcore-dmesg.txt

从系统日志中看到内核有个报错:ERST: Error Record Serialization Table (ERST) support is initialized.

ERST报错可参考说明:https://access.redhat.com/solutions/527433

3、登录服务器的带外管理后台查看下相关日志

因为公司的这台惠普服务器有带外管理页面,所以就直接登录进去看了,带外里面能看到具体的一些硬件报错信息,很方便。

于是进入到带外管理后台的 Integrated Management Log 页面,果然看到有一个CPU类型的硬件报错信息,如下:

Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000038, Bank 0x00000003, Status 0xBE000000'00800400, Address 0xFFFFFFFF'81637323, Misc 0xFFFFFFFF'81637323).

建议是:

Update the system firmware. If the issue persists, contact support.

Learn more:

https://techlibrary.hpe.com/docs/enterprise/servers/gen10/ilo5/en/class0x0005code0x0003-gen10.html

结论就是,这个要找到服务器厂家的售后工程师,协助排查并修复。

  • 1
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值