centos7.9日志排查硬件故障问题【温度过高导致硬盘故障】
工控机8g+128g.经常出现无法正常重启的情况。经过相关日志排查,定位到cpu温度过高,导致cpu时钟频率异常,并经常出现硬盘连接异常。
1./var/log/dmesg日志排查
/var/log/dmesg
日志可以排查出硬盘的故障信息。
1.1出现pci设备分配内存空间异常信息
[ 1.041807] pci 0000:00:1c.0: BAR 14: failed to assign [mem size 0x00200000]
[ 1.041816] pci 0000:00:1c.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[ 1.041821] pci 0000:00:1c.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[ 1.041827] pci 0000:00:1c.0: PCI bridge to [bus 01]
2.出现文件系统还原的信息
[ 4.721879] sda: sda1 sda2
[ 4.722851] sd 1:0:0:0: [sda] Attached SCSI disk
[ 5.242966] random: fast init done
[ 5.434299] SGI XFS with ACLs, security attributes, no debug enabled
[ 5.438591] XFS (dm-0): Mounting V5 Filesystem
[ 5.528502] XFS (dm-0): Starting recovery (logdev: internal)
[ 5.768210] XFS (dm-0): Ending recovery (logdev: internal)
2./var/log/messages日志排查
1.出现硬盘连接缓慢信息
localhost kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr oxo action 0x6 frozen
localhost kernel: ata2: link is slow to respond,please be patient (ready=0)
localhost kernel: ata2:COMRESET failed (error=-6)
localhost kernel: ata2:hard resetting link
2.出现CPU温度超过阈值(超过85度),频率下降信息
Sep 3 00:50:59 localhost kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 586124)
Sep 3 00:50:59 localhost kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 586231)
Sep 3 00:50:59 localhost kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 584496)
Sep 3 00:50:59 localhost kernel: CPU3: Core temperature above threshold, cpu clock throttled (total events = 584627)
3.查看温度
使用命令sensors
来查看当前cpu温度
4.解决方案
1.静待一段时间,等温度降下来,可能可以正常重启
2.更换新设备
3.工控机增加散热鳍片、增加8500转的风扇