内存故障分析

随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。

服务器的硬件架构

下面以CentOS为例,介绍下edac-utils 工具的使用。 在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示:

处理器0 (对应一个内存控制器)
通道0:内存插槽A1、A5 和A9
通道1:内存插槽A2、A6 和A10
通道2:内存插槽A3、A7 和A11
通道3:内存插槽A4、A8 和A12

处理器1 (对应一个内存控制器)
通道0:内存插槽B1、B5 和B9
通道1:内存插槽B2、B6 和B10
通道2:内存插槽B3、B7 和B11
通道3:内存插槽B4、B8 和B12

安装edac-utils工具

yum install -y libsysfs edac-utils

执行检测命令

通过检测命令可查看纠错提示信息:

[root@md1 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 627 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors

其中:

  • mc0 表示 表示内存控制器0,
  • CPU_Src_ID#0表示源CPU0 ,
  • Channel#0 表示通道0
  • DIMM#0 标示内存槽0,
  • Corrected Errors 代表已经纠错的次数,

内存对应关系

4x16关系
mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

12条内存的对应关系
mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
mc0: csrow1: CPU#0Channel#2_DIMM#1: A6

mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

20条内存的对应关系
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

根据前面列出的CPU通 道和内存槽对应关系即可给edac-utils 返回的信息进行编号。 即可得出A4内存出现潜在故障,接下来联系供应商进行更换即可。

参考:
http://www.cokll.com/archives/14/
http://server.51cto.com/News-568227.htm 服务器常见故障的判断与维修

转载于:https://my.oschina.net/adailinux/blog/1837514

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值