2015-12-30 下午 ngaios 监控发现硬盘满报警
2015-12-31 早上开始排查原因
经过排查发现 log 目录下的三个系统日志非常大,竟有 8.7G 大小
读日志内容后发现有大量的 EDAC DIMM CE Error 出现
经过一番 Google 后得知这是由于内存错误,导致系统自动启动错误恢复机制,但恢复失败写入日志,继续修复,循环下去导致日志文件大小暴增
粗略看了看 linux 的内核文档之 edac doc
根据这一段
Dual channels allows for 128 bit data transfers to the CPU from memory.
Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
(FB-DIMMs). The following example will assume 2 channels:
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
===================================
===================================
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
===================================
于是在机器上执行
root@ubuntu:/var/log# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:4213901959
参考前面的文档,可得出问题的是 DIMM_A1
执行 root@ubuntu:/var/log# dmidecode -t memory
,在结果中可以找到 DIMM_A1 的信息
Memory Device
Array Handle: 0x0032
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: DIMM_A1
Bank Locator: BANK0
Type: DDR3
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer0
Serial Number: SerNum1
Asset Tag: AssetTagNum1
Part Number: PartNum1
后续:
- 为了避免以后再发生这种日志撑满硬盘的情况,修改 logrotate 的配置文件,缩短日志备份周期,减少日志备份保留数量,启用备份压缩