目的
dmesg 中发现内存 ECC 校验错误
检测出有问题的内存位置
dmesg 信息
[ 4.745351] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 4.745359] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5.746989] EDAC MC0: 27609 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x105649c offset:0x6c0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 channel_mask:2 rank:1)
[ 5.747001] EDAC MC0: 23245 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x105649e offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c3 socket:0 channel_mask:1 rank:1)
[ 300.644412] mce: [Hardware Error]: Machine check events logged
获取内存错误信息
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:23245 <- 校验错误 dimm 0, channel 0, branch 0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:27609 <- 校验错误 dimm 0, channel 1, branch 0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0
参考文档信息
# yum install -y kernel-doc
已加载插件:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* updates: mirrors.sh.vclound.com
正在解决依赖关系
--> 正在检查事务
---> 软件包 kernel-doc.noarch.0.3.10.0-514.6.2.el7 将被 安装
--> 解决依赖关系完成
依赖关系解决
==============================================================================================================================
Package 架构 版本 源 大小
==============================================================================================================================
正在安装:
kernel-doc noarch 3.10.0-514.6.2.el7 updates 15 M
事务概要
==============================================================================================================================
安装 1 软件包
总下载量:15 M
安装大小:48 M
Downloading packages:
kernel-doc-3.10.0-514.6.2.el7.noarch.rpm | 15 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
正在安装 : kernel-doc-3.10.0-514.6.2.el7.noarch 1/1
验证中 : kernel-doc-3.10.0-514.6.2.el7.noarch 1/1
已安装:
kernel-doc.noarch 0:3.10.0-514.6.2.el7
完毕!
参考下面信息
vim /usr/share/doc/kernel-doc-3.10.0/Documentation/edac.txt
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
===================================
===================================
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
===================================
从上面可以看出,这里分两部分内存组, mc0, mc1
内存组 mc0 中第一第二内存 ECC 故障
即 DIMM 0 中的 channel 0 与 channel 1 位置
获取内存位置
dmidecode -t memory | grep -E 'Memory Device|Size:|Locator'
Memory Device
Size: 16384 MB
Locator: DIMM000
Bank Locator: BRANCH 0 CHANNEL 0 DIMM 0 <- 故障
Memory Device
Size: No Module Installed
Locator: DIMM001
Bank Locator: BRANCH 0 CHANNEL 0 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM010
Bank Locator: BRANCH 0 CHANNEL 1 DIMM 0 <- 故障
Memory Device
Size: No Module Installed
Locator: DIMM011
Bank Locator: BRANCH 0 CHANNEL 1 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM020
Bank Locator: BRANCH 0 CHANNEL 2 DIMM 0
Memory Device
Size: No Module Installed
Locator: DIMM021
Bank Locator: BRANCH 0 CHANNEL 2 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM030
Bank Locator: BRANCH 0 CHANNEL 3 DIMM 0
Memory Device
Size: No Module Installed
Locator: DIMM031
Bank Locator: BRANCH 0 CHANNEL 3 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM100
Bank Locator: BRANCH 1 CHANNEL 0 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM101
Bank Locator: BRANCH 1 CHANNEL 0 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM110
Bank Locator: BRANCH 1 CHANNEL 1 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM111
Bank Locator: BRANCH 1 CHANNEL 1 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM120
Bank Locator: BRANCH 1 CHANNEL 2 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM121
Bank Locator: BRANCH 1 CHANNEL 2 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM130
Bank Locator: BRANCH 1 CHANNEL 3 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM131
Bank Locator: BRANCH 1 CHANNEL 3 DIMM 1