系统为ubuntu20.04.6,超微x9dri-f双路主板,两颗e5 2600v2系列cpu,内存条插满。主要用来跑深度学习模型,在训练过程中经常会出现死机的现象,开启mcelog后有如下报错
[49150.466577] mce: [Hardware Error]: Machine check events logged
[49150.466591] mce: [Hardware Error]: Machine check events logged
Hardware event. This is not a software error.
MCE 0
CPU 12 BANK 9 TSC 49c688083e1dd
MISC d221010001000c8c
TIME 1686062534 Tue Jun 6 22:42:14 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
MemCtrl: Corrected memory read error
STATUS c800008600800090 MCGSTATUS 0
MCGCAP 1000c1d APICID 20 SOCKETID 1
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 0
CPU 12 BANK 9 TSC 49c689e9ed41b
MISC d221010001000c8c
TIME 1686062534 Tue Jun 6 22:42:14 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
MemCtrl: Corrected memory read error
STATUS c800008600800090 MCGSTATUS 0
MCGCAP 1000c1d APICID 20 SOCKETID 1
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
有那么几个令人在意的点,似乎是内存的问题,多方查找无法找到定位cpu12 bank9位置的方法,最后在chatgpt的帮助下找到如下指令
sudo lshw -class memory # 列出系统中所有的内存设备信息,包括内存模块所在的插槽位置
在我电脑上运行后得
*-memory
description: System Memory
physical id: 2d
slot: System board or motherboard
size: 208GiB
capabilities: ecc
configuration: errordetection=multi-bit-ecc
*-bank:0
.......................
*-bank:9
description: DIMM Synchronous [empty]
product: Dimm1_PartNum
vendor: Dimm1_Manufacturer
physical id: 9
serial: Dimm1_SerNum
slot: P2_DIMME2
width: 64 bits
..............................................
*-bank:14
*
*-bank:15
*
可以看到bank9处对应我电脑的E2内存插槽,应该是内存条有问题,拔下或者更换。
mcelog安装
有些博客给出直接
sudo apt-get install mcelog
但在我的系统上无法使用
官网给出的方式之一:
git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git # 将相关文件下载到本地
cp mcelog.service /usr/lib/systemd/system # 把mcelog服务文件放入系统相关的目录
systemctl enable mcelog.service # 开机启动
systemctl start mcelog.service # 启动 如果遇到报错,输出文档在/var/log/syslog中